gfx1103 (7840U): HW Exception by GPU node-1

jrl290 commented 1 month ago

I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030): HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang

I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.

Maybe there's something much more obvious to you or an easier way to track down the issue

Some commands it has crashed on:

torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
torch.zeros([*batch_dims, c, n - f, t]).to(device)
torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
torch.cuda.synchronize()

Here's the kernel log with a few of these crashes

2024-08-18T01:19:27.141093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:27.141108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:19:27.141109+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:27.141110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:19:27.141111+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:27.141112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:19:29.149118+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:29.149134+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.153110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.153120+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:19:31.155110+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.155120+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.155122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.191082+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.191086+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.191087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.194062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.196063+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.202087+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.202089+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.202090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.202091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.202092+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.202093+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.202094+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.203062+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.203064+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.203065+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(3) succeeded!
2024-08-18T01:19:31.351136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:19:31.351156+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1000
2024-08-18T01:19:31.351158+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:19:31.351159+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu: Resetting wave fronts (cpsch) on dev 000000000d034e53
2024-08-18T01:19:31.351160+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: no vmid pasid mapping supported
2024-08-18T01:19:31.352108+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:19:31.358069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:19:31.359069+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:19:31.359072+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:19:31.395080+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:19:31.396064+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:19:31.396066+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:19:31.397084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:19:31.400201+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:19:31.406237+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:19:31.406248+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:19:31.406250+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:19:31.406251+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:19:31.406253+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:19:31.406254+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:19:31.406255+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:19:31.406256+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:19:31.406257+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:19:31.406258+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:19:31.406259+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:19:31.406260+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:19:31.408175+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:19:31.408185+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:19:31.408186+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(4) succeeded!
2024-08-18T01:20:57.766084+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:57.766102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:20:57.766103+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:20:57.766104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:20:57.766105+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:20:58.945078+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:20:59.773318+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:20:59.773338+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.778088+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:01.778107+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:01.780090+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:01.780097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:01.780098+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:01.815205+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:01.816084+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:01.816091+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:01.818100+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:01.820479+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:01.825115+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:01.825125+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:01.825127+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:01.825129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:01.825130+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:01.825131+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:01.825132+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:01.825133+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:01.825134+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:01.825135+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:01.825136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:01.825152+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:01.826104+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:01.826114+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:01.826116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(5) succeeded!
2024-08-18T01:21:36.676703+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:36.676722+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
2024-08-18T01:21:36.676724+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
2024-08-18T01:21:36.676725+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict queue 1
2024-08-18T01:21:36.676726+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to evict process queues
2024-08-18T01:21:36.676728+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset begin!
2024-08-18T01:21:36.676739+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: remove_all_queues_mes: Failed to remove queue 0 for dev 45725
2024-08-18T01:21:37.851129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Failed to remove queue 0
2024-08-18T01:21:38.685097+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:38.685112+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.689195+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MES failed to respond to msg=REMOVE_QUEUE
2024-08-18T01:21:40.689207+00:00 minipc kernel: [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
2024-08-18T01:21:40.691116+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State
2024-08-18T01:21:40.691128+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: Dumping IP State Completed
2024-08-18T01:21:40.691129+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: MODE2 reset
2024-08-18T01:21:40.715712+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
2024-08-18T01:21:40.726112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset succeeded, trying to resume
2024-08-18T01:21:40.726118+00:00 minipc kernel: [drm] PCIE GART of 512M enabled (table at 0x000000807FD00000).
2024-08-18T01:21:40.726119+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resuming...
2024-08-18T01:21:40.728102+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: SMU is resumed successfully!
2024-08-18T01:21:40.730112+00:00 minipc kernel: [drm] DMUB hardware initialized: version=0x08003700
2024-08-18T01:21:40.735126+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
2024-08-18T01:21:40.735136+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2024-08-18T01:21:40.735138+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2024-08-18T01:21:40.735139+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
2024-08-18T01:21:40.735140+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
2024-08-18T01:21:40.735141+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
2024-08-18T01:21:40.735142+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
2024-08-18T01:21:40.735143+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
2024-08-18T01:21:40.735144+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
2024-08-18T01:21:40.735145+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
2024-08-18T01:21:40.735146+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
2024-08-18T01:21:40.735147+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
2024-08-18T01:21:40.735163+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
2024-08-18T01:21:40.737112+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow start
2024-08-18T01:21:40.737122+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: recover vram bo from shadow done
2024-08-18T01:21:40.737123+00:00 minipc kernel: amdgpu 0000:c3:00.0: amdgpu: GPU reset(6) succeeded!
2024-08-18T01:41:34.118180+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 5 times, consider switching to WQ_UNBOUND
2024-08-18T01:49:08.684127+00:00 minipc kernel: workqueue: kfd_process_wq_release [amdgpu] hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND

jeroen-mostert commented 1 month ago

Some ideas.

There are a bunch of environment variables you can set to enable debugging, the most prominent of which is AMD_LOG_LEVEL. On the highest level this spews out a lot of logging if not masked with AMD_LOG_MASK, and a lot of it is super technical, so it's not generally that helpful. But it might track the problem down to the last thing that was being attempted. As that page describes, you can also use rocgdb to get a backtrace on a crash, but while powerful, gdb is a bit of an ogre if you're not already familiar with it.

I'm not familiar with how the setup on mobile works, but try to ensure that the GPU you're doing compute on is not also the one driving the primary display. This should work fine (generally the worst that can happen is running out of memory, not a crash), but it's still noise you're better off without. If some fancy desktop compositor is tickling the card in a way that's not appropriately combined with compute, this could cause the driver to choke. That would still generally be a driver bug as that should never trigger a reset, but such things do happen.

Lastly, although this doesn't really sound like a hardware problem given the random nature of the crash, you can still try installing corectrl (or building it from this SDK if your distro doesn't offer it) and see if limiting the GPU by power or clock speed improves things, since compute tends to hammer things much harder than 3D apps. Note that not all GPU models actually support this kind of tuning (the iGPU in desktop processors does not, for example) and you may need a kernel parameter to unlock the support in the first place. (Note that the wiki talks about overclocking, but it's really about unlocking voltage/speed parameters that also allow underclocking, which can help give you better performance by not driving the thing up to the thermal limit constantly. I do not recommend manually writing things to /sys as described in the wiki, just use a tool.)

jrl290 commented 1 month ago

Thanks so much for the reply. I will definitely dive-in with the tools you mentioned tomorrow

As for primary display, it's being run in server mode. I'm ssh-ing in to run console commands

And while trying to make a minimum demonstrating code, I was able to crank the compute to the max. I even used the same model and fft functions that are at the heart of my application and couldn't get it to crash. But the rest of the code is much more complex. So while I'm coming at it from both directions, it's still taking me a long time to unravel and isolate pieces of code to test

jeroen-mostert commented 1 month ago

It sounds like one or the other piece might be introducing some kind of memory corruption or resource exhaustion that then catches up with the other operations. Unfortunately such things are notoriously hard to debug, since the offending operation isn't necessarily the one that crashes. However, if you have cases where it crashes right up front, these should at least minimize the amount of logging/tracing you have to trawl through, it's just a matter of retrying.

lamikr commented 1 month ago

One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes for APU's on the latest kernel. I have also seen sometime "gpu hang" issues on gfx1103 that I do not see on the gfx1035 when running something more extensive.

If you have some code that you could share that will very likely trigger the problem, that would help the testing. I have a feeling that if the problem persist even with the latest kernel, the problem can be either on the kernel side of code or on the userspace code that communicates with the kernel for sending there code and receiving responses. I may have somewhere some old notes for tracing similar problems when I traced long time ago some similar type of problems with 2400g/vega apu.

jrl290 commented 1 month ago

Installed the latest kernel. No luck

Turned on logging. This is the error level log. The GPU Hang doesn't appear in the error log when it happens. I'm still parsing through the "everything" log. But maybe something jumps out at you.

:1:hip_fatbin.cpp           :259 : 0469299906 us: [pid:1892  tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :112 : 0469299920 us: [pid:1892  tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp           :115 : 0469299923 us: [pid:1892  tid:0x7c50f5445b80]      amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :259 : 0470625704 us: [pid:1892  tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp           :112 : 0470625717 us: [pid:1892  tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp           :115 : 0470625720 us: [pid:1892  tid:0x7c50f5445b80]      amdgcn-amd-amdhsa--gfx1103

:1:hip_code_object.cpp      :624 : 0474035591 us: [pid:1892  tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
:1:hip_module.cpp           :84  : 0474035608 us: [pid:1892  tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x71101ab0

I'm still having trouble isolating the problem even just to collect a log from a single command that hangs (otherwise its megabytes of text). But I'm still working on it. I'll send some code when I finally get it down to a reasonable enough length to be readable

lamikr commented 1 month ago

Well it's good to know that the fix is not there in new kernel. Just to verify other thing. Once the gpu-reset happen, the system still is able to reset the gpu without you needing to do a full reboot?

jrl290 commented 1 month ago

I've had kernels and other ubuntu versions fully lock up. On the versions I'm using now, the GPU is able to recover. Though of course the full python process is killed

jrl290 commented 1 month ago

Ok, here are two Level 4 logs. One in which the crash occurs almost immediately, and another which gets past the crash point (without the stuff passed the crash point). I'm looking through them now noncrash.txt crash.txt

Let me know if you think it would be useful for me to go through and match the two logs line by line

jrl290 commented 1 month ago

I ended up going through matching the log anyway. Here's the Google Sheet with the comparison: https://docs.google.com/spreadsheets/d/1ZbOBMm2xRa-i0djBYTJwff2Eoee0qow9lBKyyTRBXTM/edit?usp=sharing

The two match up pretty substantially. Most discrepancies are Host wait on completion_signal= and Host active wait for Signal =. And then there are a couple of sections towards the end where the order of calls is shuffled a bit

jrl290 commented 1 week ago

EDIT: I eliminated more code

Ok, sorry I got a bit side tracked on this. Here is minimum code to cause the crash (files):

import torch
import numpy as np
from onnx import load
from onnx2pytorch import ConvertModel
import os

os.environ["AMD_LOG_LEVEL"] = "4"

if __name__ == "__main__":
    model_path = "model.onnx"
    device = 'cuda'
    model_run = ConvertModel(load(model_path))
    model_run.to(device).eval()

    #It does not seem to want to crash if this line is commented out
    random = np.random.rand(1, 4, 3072, 256)

    while True:
        print("Loop Start")
        tensor = torch.randn(1, 4, 3072, 256, dtype=torch.float32, device=device)

        print("The crash happens here:")
        result = model_run(tensor)

Hopefully this makes it easy to diagnose the issue

jrl290 commented 1 week ago

First off, no matter how long I run it, if that numpy.random line isn't in there, the script doesn't crash. What could that possibly mean?

Also it looks like there are two separate crashes. One comes on malloc:

Success hipMalloc ( 0x7ffc8851c648, 18874368 ) hipMalloc ( 0x7ffe3fa35778, 75497472 ) hipMalloc ( 0x7ffe3fa36cb8, 150994944 )

Crash hipMalloc ( 0x7fff9307f708, 12582912 ) hipMalloc ( 0x7ffc06bbd5f8, 12582912 ) hipMalloc ( 0x7ffddf593878, 75497472 )

The other one seems to come on some sort of synchronization/lock/barrier:

Success

:4:rocvirtual.cpp           :1071: 9886723816 us: [pid:4059  tid:0x73adad3b1b80] HWq=0x73ac70f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x73ac73bff900
:3:rocvirtual.hpp           :66  : 9886723837 us: [pid:4059  tid:0x73adad3b1b80] Host active wait for Signal = (0x73ac73bffa80) for -1 ns
:4:rocvirtual.cpp           :898 : 9887923934 us: [pid:4059  tid:0x73adad3b1b80] HWq=0x73ac70f00000, Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=1, grid=[3072, 1, 1], workgroup=[512, 1, 1], private_seg_size=0, group_seg_size=0, kernel_obj=0x73ad39ca5980, kernarg_address=0x73ac70500000, completion_signal=0x0

Crash:

:4:command.cpp              :346 : 1946568251 us: [pid:2421  tid:0x7b54a1492b80] Command (CopyDeviceToHost) enqueued: 0x57c567691200
:4:rocmemory.cpp            :988 : 1946568973 us: [pid:2421  tid:0x7b54a1492b80] Locking to pool 0x57c5673e0860, size 0xc01000, HostPtr = 0x57c5702b8000, DevPtr = 0x57c5702b8000
:4:rocvirtual.cpp           :1071: 1946568987 us: [pid:2421  tid:0x7b54a1492b80] HWq=0x7b5364f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7b5367bfea00
:3:rocvirtual.hpp           :66  : 1946568998 us: [pid:2421  tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfea00) for 10000 ns
:4:rocblit.cpp              :750 : 1946569026 us: [pid:2421  tid:0x7b54a1492b80] HSA Async Copy on copy_engine=0x1, dst=0x57c5702b8080, src=0x7b5259e00000, size=12582912, forceSDMA=0, wait_event=0x7b5367bfea00, completion_signal=0x7b5367bfe980
:4:rocvirtual.cpp           :570 : 1946569040 us: [pid:2421  tid:0x7b54a1492b80] Host wait on completion_signal=0x7b5367bfe980
:3:rocvirtual.hpp           :66  : 1946569055 us: [pid:2421  tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfe980) for -1 ns

jrl290 commented 1 week ago

Just worth mentioning. It seems there are major AMDGPU changes happening in linux kernel updates recently. So probably best to wait before trying any more diagnosing of such issues: https://www.phoronix.com/news/AMDGPU-Linux-6.12-More-PQ-Reset https://www.phoronix.com/news/Linux-6.11-rc7-AMDGPU-Fix

lamikr / rocm_sdk_builder

gfx1103 (7840U): HW Exception by GPU node-1 #141