Closed Javatar81 closed 2 years ago
storm3 log last messages bevor reboot:
Apr 19 16:15:50 storm3 sanlock[2063]: 2022-04-19 16:15:50 251755 [2063]: s1 kill 570915 sig 15 count 26
Apr 19 16:15:51 storm3 wdmd[2062]: test failed rem 20 now 251755 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:51 storm3 named[2570]: resolver priming query complete
Apr 19 16:15:51 storm3 sanlock[2063]: 2022-04-19 16:15:51 251756 [2063]: s1 kill 570915 sig 15 count 27
Apr 19 16:15:52 storm3 wdmd[2062]: test failed rem 19 now 251756 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:52 storm3 sanlock[2063]: 2022-04-19 16:15:52 251757 [2063]: s1 kill 570915 sig 15 count 28
Apr 19 16:15:53 storm3 wdmd[2062]: test failed rem 18 now 251757 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:53 storm3 vdsm[4420]: WARN executor state: count=5 workers={<Worker name=qgapoller/1 waiting task#=12582 at 0x7fd6f904d400>, <W
orker name=qgapoller/0 waiting task#=12582 at 0x7fd6f903c860>, <Worker name=qgapoller/2 running <Task discardable <Operation action=<bound m
ethod QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fd6f904dcc0>> at 0x7fd6f903c550> timeout=3
0, duration=30.00 at 0x7fd6faf20358> discarded task#=12581 at 0x7fd6f905e470>, <Worker name=qgapoller/3 waiting task#=12581 at 0x7fd6f905e51
8>, <Worker name=qgapoller/4 waiting task#=0 at 0x7fd6f84f8da0>}
Apr 19 16:15:53 storm3 sanlock[2063]: 2022-04-19 16:15:53 251758 [2063]: s1 kill 570915 sig 15 count 29
Apr 19 16:15:54 storm3 wdmd[2062]: test failed rem 17 now 251758 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_24aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
both storm3 and 6 rebooted, RHEV is back online. Keeping this issue open to monitor it.
On storm6, I did change the BIOS setting regarding the Watchdog timer: It was:
Now it is ENABLED - let's see if this changes something.
Did happen today again, both storm3 +6 affected.
changing the watchdog setting in bios did not help, the system now really just freezes. what I think is trange that storm3 and storm6 do this kind of simultaneously. I will set rhev engine on storm3 in local maintenance mode, lets see if this helps.
[root@storm3 ~]# hosted-engine --set-maintenance --mode=local
Storm3 is down again, storm 6 is running. Memtest is running on storm3
Memtest looks good.
Reboot into RHEV
Just saw a note regrading Bios: https://www.reddit.com/r/vmware/comments/udodo0/avoid_dell_bios_2140_on_r730/
"dsu"- command line actually shows that downgrade. Executing it now:
Lets see if this helps
downgraded storm6 today, too (pending reboot) as it hung up yesterday evening. but storm3 maybe had another hang yesterday evening, too (Steffen reported a hung and did IDRac reboot)
had a storm2 hangup on Jul/2, which got fixed by fence agent reboot. BIOD downgrade to storm2 and storm5 today. So all hypervisors are now downgraded to BIOS 2.13.0.
looks like the bios downgrade did fix the issue - we had no hangup / freeze since then. Closing this issue now.
storm6 also, was again the watchdog issue: