Open rajkumar38 opened 4 years ago
applyView: current view switches: 0 != temporary view switches: 1, FATAL this suggest that there was not switch in previous run of syncd, what scenario is this ? is it cold boot, then warm boot, or cold boot only ? this seems to be running on VS, is this consistent repro ?
something is wrong here: Feb 14 10:22:37.915367 sonic NOTICE syncd#syncd: :- helperLoadColdVids: read 1346 COLD VIDS Feb 14 10:22:44.154247 sonic NOTICE syncd#syncd: :- onSyncdStart: hard reinit succeeded hard reinit succeeded, and there were over 1k objects in db
Feb 14 10:23:37.043951 sonic NOTICE syncd#syncd: :- onSwitchCreateInInitViewMode: new switch oid:0x21000000000000 contains hardware info: '' // this indicates that switch is already present in syncd memory, which is ok
Feb 14 10:23:37.450999 sonic NOTICE syncd#syncd: :- dump: getting took 0.000143 sec Feb 14 10:23:37.451152 sonic NOTICE syncd#syncd: :- getAsicView: ASIC_STATE switch count: 0: // and this is wrong, no switches in database dump, but it was already loaded, also dump database time is very low, so it suggest db is empty was somehow db cleared ?
Feb 14 10:23:37.451342 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from ASIC_STATE took 0.000990 sec Feb 14 10:23:37.454001 sonic NOTICE syncd#syncd: :- dump: getting took 0.002015 sec Feb 14 10:23:37.454562 sonic NOTICE syncd#syncd: :- getAsicView: TEMP_ASIC_STATE switch count: 1: Feb 14 10:23:37.454699 sonic NOTICE syncd#syncd: :- getAsicView: oid:0x21000000000000: objects count: 64 Fe b 14 10:23:37.455059 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from TEMP_ASIC_STATE took 0.003421 sec Feb 14 10:23:37.455239 sonic ERR syncd#syncd: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL
after reaching this state, could you check "redis-cli -n 1" and "keys *" to see if there are any ASIC_STATE keys except TEMP_ASIC_STATE ?
applyView: current view switches: 0 != temporary view switches: 1, FATAL this suggest that there was not switch in previous run of syncd, what scenario is this ? is it cold boot, then warm boot, or cold boot only ? this seems to be running on VS, is this consistent repro ?
Using ET-6448M board comprising ARMADA-385 CPU, with 48 1G Copper ports, and 4 10G SFP+ ports.
1) In stand-alone box, load t1 preset config and reboot the box. 2) Verify all the ports are present and IP address are set. 3) Restart syncd, using command "sudo systemctl restart syncd" multiple times. I'm consistently seeing this issue on second iteration of syncd restart.
after reaching this state, could you check "redis-cli -n 1" and "keys *" to see if there are any ASIC_STATE keys except TEMP_ASIC_STATE ?
Attached redis-cli output in issue state. ASIC.txt
in ASIC.txt there is no ASIC_STATE* entries, this seems like those entreis were removed or database was cleared, could we work together on that this week?
in ASIC.txt there is no ASIC_STATE* entries, this seems like those entreis were removed or database was cleared, could we work together on that this week?
Yes. We can work on this.
Since ASIC.txt dump don't contain "HIDDEN" and any ASIC_STATE* keys, then this means database was flushed, also i suspect RIDTOVID and VIDTORID hashes to be also small count like less than < 100, could you check that also "keys RIDTOVID"
if so, then this means some script probably did FLUSHALL on database, most likely outside syncd/sairedis/OA
Since ASIC.txt dump don't contain "HIDDEN" and any ASIC_STATE* keys, then this means database was flushed, also i suspect RIDTOVID and VIDTORID hashes to be also small count like less than < 100, could you check that also "keys RIDTOVID"
if so, then this means some script probably did FLUSHALL on database, most likely outside syncd/sairedis/OA
"keys RIDTOVID" has some 116 entries. ASIC_RIDTOVID_keys.txt
Additionally, I collected the "redis-cli monitor" logs. redis_monitor_logs.txt
And syncd logs with "DEBUG" level enabled. syslog_DEBUG_logs.txt
This shows that lua scripts loaded in syncd context as part of docker restart, cleans-up the ASIC state. Hope these logs are helpful.
Note that in the above log file, syncd is crashed. This is because of 32-bit arch, which we will fix in parallel.
Hey, thanks for those, logs, this explains the issue, take a look here: 1550139701.569199 [1 unix:/var/run/redis/redis.sock] "KEYS" "ASIC_STATE:" // full of objects ... 1550139724.135934 [0 127.0.0.1:53888] "SELECT" "1" 1550139724.137403 [1 127.0.0.1:53888] "FLUSHDB" // THIS CLEARS DB ! ... 1550139761.398453 [1 lua] "HSET" "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_ROUTE_ENTRY:{\"de // OA populates temp view .. 1550139761.738593 [1 lua] "KEYS" "ASIC_STATE:" // this one is empty 1550139760.677580 [1 unix:/var/run/redis/redis.sock] "KEYS" "TEMP_ASIC_STATE:*" // this one is full of objects
will try investigate this internally
this is related to sonic-buildimage/files/scripts/swss.sh execution, seems like this script is executed when OA and syncd is already rinning
could you try to execute "sudo systemctl restart swss" or "config reload" instead of "syncd" ?, i think currently restart syncd is not supported. is there are reason why you want to restart syncd only?
could you try to execute "sudo systemctl restart swss" or "config reload" instead of "syncd" ?, i think currently restart syncd is not supported. is there are reason why you want to restart syncd only?
Thanks for the update. I see this issue only with restart_syncd, all other cases (reboot,config reload, restart swss ) are fine. Any reason why restart_syncd is not supported ? I see restart_syncd test-case in sonic-mgmt repo, https://github.com/Azure/sonic-mgmt/blob/master/ansible/roles/test/tasks/restart_syncd.yml
@yxieca @kcudnik Hi, Is this bug scheduled for fix ? Let me know if you need any more logs to fix this issue. Thanks.
i dont know why restart_syncd is not supported, maybe @yxieca can help here from syncd/sairedis this is not a bug, and supporing restart_syncd command is on the system level
restart syncd should be the same as restarting swss. which could trigger some dependent service failure. Issue is not really with swss or syncd. at this moment, config reload is the safest way to recover.
With that said, I think we should try to make restarting syncd/swss work.
@kcudnik @lguohan : Pls check.
All dockers are fully up and then issued "sudo systemctl restart syncd", it fails in apply_view which causes orchagent crash. Attached full logs for reference.
sonic-buildimage commit : f31eabb5ee65f7d37d57d0da85dacf39d3b5fad1 sonic-buildimage/src/sonic-sairedis commit 322dd01db434d00394898d4d1ac34433e95447cd
syslog_FATAL_error.txt
Feb 14 10:23:37.448931 sonic NOTICE swss#orchagent: :- syncd_apply_view: Notify syncd APPLY_VIEW Feb 14 10:23:37.448931 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: APPLY_VIEW Feb 14 10:23:37.450194 sonic WARNING syncd#syncd: :- processNotifySyncd: syncd received APPLY VIEW, will translate Feb 14 10:23:37.450999 sonic NOTICE syncd#syncd: :- dump: getting took 0.000143 sec Feb 14 10:23:37.451152 sonic NOTICE syncd#syncd: :- getAsicView: ASIC_STATE switch count: 0: Feb 14 10:23:37.451342 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from ASIC_STATE took 0.000990 sec Feb 14 10:23:37.454001 sonic NOTICE syncd#syncd: :- dump: getting took 0.002015 sec Feb 14 10:23:37.454562 sonic NOTICE syncd#syncd: :- getAsicView: TEMP_ASIC_STATE switch count: 1: Feb 14 10:23:37.454699 sonic NOTICE syncd#syncd: :- getAsicView: oid:0x21000000000000: objects count: 64 Feb 14 10:23:37.455059 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from TEMP_ASIC_STATE took 0.003421 sec Feb 14 10:23:37.455239 sonic ERR syncd#syncd: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL Feb 14 10:23:37.475524 sonic NOTICE syncd#syncd: :- applyView: apply took 0.025017 sec Feb 14 10:23:37.477195 sonic ERR syncd#syncd: :- run: Runtime error: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL Feb 14 10:23:37.477339 sonic NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000 Feb 14 10:23:37.485238 sonic NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull Feb 14 10:23:37.505620 sonic INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::invalid_argument' Feb 14 10:23:37.506549 sonic INFO swss#supervisord: orchagent what(): parse error - unexpected 'o'
=============================================================================
sroot@sonic:/home/admin# show version
SONiC Software Version: SONiC.jun9.0-dirty-20200706.194346 Distribution: Debian 10.4 Kernel: 4.19.0-6-armmp Build commit: f31eabb5 Build date: Mon Jul 6 14:54:32 UTC 2020 Built by: marvell@cpss-build7
Platform: armhf-marvell_et6448m_52x-r0 HwSKU: et6448m ASIC: marvell Serial Number: 201908150001 Uptime: 11:38:48 up 1:27, 1 user, load average: 0.51, 0.56, 0.52
Docker images: REPOSITORY TAG IMAGE ID SIZE docker-orchagent jun9.0-dirty-20200706.194346 9af6ab3099fe 352MB docker-orchagent latest 9af6ab3099fe 352MB docker-fpm-frr jun9.0-dirty-20200706.194346 1dedd2aa317c 300MB docker-fpm-frr latest 1dedd2aa317c 300MB docker-sflow jun9.0-dirty-20200706.194346 b6fb327d6add 286MB docker-sflow latest b6fb327d6add 286MB docker-nat jun9.0-dirty-20200706.194346 696e475b4d4c 287MB docker-nat latest 696e475b4d4c 287MB docker-teamd jun9.0-dirty-20200706.194346 c203e683b9a5 286MB docker-teamd latest c203e683b9a5 286MB docker-syncd-mrvl jun9.0-dirty-20200706.194346 a15ba5251617 531MB docker-syncd-mrvl latest a15ba5251617 531MB docker-snmp-sv2 jun9.0-dirty-20200706.194346 1967491bf338 348MB docker-snmp-sv2 latest 1967491bf338 348MB docker-lldp jun9.0-dirty-20200706.194346 f49ecd7684d8 328MB docker-lldp latest f49ecd7684d8 328MB docker-database jun9.0-dirty-20200706.194346 c85327cc927e 313MB docker-database latest c85327cc927e 313MB docker-platform-monitor jun9.0-dirty-20200706.194346 9752ae58bea8 322MB docker-platform-monitor latest 9752ae58bea8 322MB docker-router-advertiser jun9.0-dirty-20200706.194346 be949c4f52b2 270MB docker-router-advertiser latest be949c4f52b2 270MB docker-dhcp-relay jun9.0-dirty-20200706.194346 a05bf6ab2d92 277MB docker-dhcp-relay latest a05bf6ab2d92 277MB