illinois-scicomp / machine-shop-maintenance

Scripts and Issues for the birds and the beers
10 stars 3 forks source link

Dunkel is dead #66

Open inducer opened 2 years ago

inducer commented 2 years ago

Currently, lots of messages like this in the dmesg:

[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: CPU 0: Machine Check Event: 0 Bank 7: cc00418000010091
[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: TSC 0 
[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: ADDR 2031c14940 
[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: MISC 150481a86 
[Mon Oct  3 13:52:42 2022] EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1664823350 SOCKET 0 APIC 0
[Mon Oct  3 13:52:42 2022] EDAC MC1: 262 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2031c14 offset:0x940 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:1)

cc @rckirby

inducer commented 2 years ago

They seem to be happening every few seconds or so.

Rebooted it, to see if that helps.

rckirby commented 2 years ago

Seems not to have come back up?

***@***.***:~/2022-nsf-transform-nonlocal$ ssh dunkel
channel 0: open failed: connect failed: No route to host
stdio forwarding failed
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535
inducer commented 2 years ago

https://www.complang.tuwien.ac.at/anton/failing-memory.html has a description of someone troubleshooting a similar issue.

inducer commented 2 years ago

@lukeolson What's the latest here? @kaushikcfd will transfer the GPU out of dunkel to keep it usable this afternoon.