stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

Storm3 is down #82

Closed Javatar81 closed 2 years ago

DanielFroehlich commented 2 years ago

storm6 also, was again the watchdog issue: image

DanielFroehlich commented 2 years ago

storm3 log last messages bevor reboot:

Apr 19 16:15:50 storm3 sanlock[2063]: 2022-04-19 16:15:50 251755 [2063]: s1 kill 570915 sig 15 count 26
Apr 19 16:15:51 storm3 wdmd[2062]: test failed rem 20 now 251755 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:51 storm3 named[2570]: resolver priming query complete
Apr 19 16:15:51 storm3 sanlock[2063]: 2022-04-19 16:15:51 251756 [2063]: s1 kill 570915 sig 15 count 27
Apr 19 16:15:52 storm3 wdmd[2062]: test failed rem 19 now 251756 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:52 storm3 sanlock[2063]: 2022-04-19 16:15:52 251757 [2063]: s1 kill 570915 sig 15 count 28
Apr 19 16:15:53 storm3 wdmd[2062]: test failed rem 18 now 251757 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_2
4aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
Apr 19 16:15:53 storm3 vdsm[4420]: WARN executor state: count=5 workers={<Worker name=qgapoller/1 waiting task#=12582 at 0x7fd6f904d400>, <W
orker name=qgapoller/0 waiting task#=12582 at 0x7fd6f903c860>, <Worker name=qgapoller/2 running <Task discardable <Operation action=<bound m
ethod QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fd6f904dcc0>> at 0x7fd6f903c550> timeout=3
0, duration=30.00 at 0x7fd6faf20358> discarded task#=12581 at 0x7fd6f905e470>, <Worker name=qgapoller/3 waiting task#=12581 at 0x7fd6f905e51
8>, <Worker name=qgapoller/4 waiting task#=0 at 0x7fd6f84f8da0>}
Apr 19 16:15:53 storm3 sanlock[2063]: 2022-04-19 16:15:53 251758 [2063]: s1 kill 570915 sig 15 count 29
Apr 19 16:15:54 storm3 wdmd[2062]: test failed rem 17 now 251758 ping 251705 close 251715 renewal 251650 expire 251730 client 2063 sanlock_24aaade7-ee5a-49a4-8220-5dbb4f71f8aa:1
DanielFroehlich commented 2 years ago

both storm3 and 6 rebooted, RHEV is back online. Keeping this issue open to monitor it.

On storm6, I did change the BIOS setting regarding the Watchdog timer: It was: image

Now it is ENABLED - let's see if this changes something.

DanielFroehlich commented 2 years ago

Did happen today again, both storm3 +6 affected. changing the watchdog setting in bios did not help, the system now really just freezes. what I think is trange that storm3 and storm6 do this kind of simultaneously. I will set rhev engine on storm3 in local maintenance mode, lets see if this helps. [root@storm3 ~]# hosted-engine --set-maintenance --mode=local

rbo commented 2 years ago

Storm3 is down again, storm 6 is running. Memtest is running on storm3 image

rbo commented 2 years ago

image Memtest looks good.

rbo commented 2 years ago

Reboot into RHEV

DanielFroehlich commented 2 years ago

Just saw a note regrading Bios: https://www.reddit.com/r/vmware/comments/udodo0/avoid_dell_bios_2140_on_r730/ image

"dsu"- command line actually shows that downgrade. Executing it now: image

Lets see if this helps

DanielFroehlich commented 2 years ago

downgraded storm6 today, too (pending reboot) as it hung up yesterday evening. but storm3 maybe had another hang yesterday evening, too (Steffen reported a hung and did IDRac reboot)

DanielFroehlich commented 2 years ago

had a storm2 hangup on Jul/2, which got fixed by fence agent reboot. BIOD downgrade to storm2 and storm5 today. So all hypervisors are now downgraded to BIOS 2.13.0.

DanielFroehlich commented 2 years ago

looks like the bios downgrade did fix the issue - we had no hangup / freeze since then. Closing this issue now.