Open jrl290 opened 1 month ago
Some ideas.
There are a bunch of environment variables you can set to enable debugging, the most prominent of which is AMD_LOG_LEVEL
. On the highest level this spews out a lot of logging if not masked with AMD_LOG_MASK
, and a lot of it is super technical, so it's not generally that helpful. But it might track the problem down to the last thing that was being attempted. As that page describes, you can also use rocgdb to get a backtrace on a crash, but while powerful, gdb is a bit of an ogre if you're not already familiar with it.
I'm not familiar with how the setup on mobile works, but try to ensure that the GPU you're doing compute on is not also the one driving the primary display. This should work fine (generally the worst that can happen is running out of memory, not a crash), but it's still noise you're better off without. If some fancy desktop compositor is tickling the card in a way that's not appropriately combined with compute, this could cause the driver to choke. That would still generally be a driver bug as that should never trigger a reset, but such things do happen.
Lastly, although this doesn't really sound like a hardware problem given the random nature of the crash, you can still try installing corectrl
(or building it from this SDK if your distro doesn't offer it) and see if limiting the GPU by power or clock speed improves things, since compute tends to hammer things much harder than 3D apps. Note that not all GPU models actually support this kind of tuning (the iGPU in desktop processors does not, for example) and you may need a kernel parameter to unlock the support in the first place. (Note that the wiki talks about overclocking, but it's really about unlocking voltage/speed parameters that also allow underclocking, which can help give you better performance by not driving the thing up to the thermal limit constantly. I do not recommend manually writing things to /sys
as described in the wiki, just use a tool.)
Thanks so much for the reply. I will definitely dive-in with the tools you mentioned tomorrow
As for primary display, it's being run in server mode. I'm ssh-ing in to run console commands
And while trying to make a minimum demonstrating code, I was able to crank the compute to the max. I even used the same model and fft functions that are at the heart of my application and couldn't get it to crash. But the rest of the code is much more complex. So while I'm coming at it from both directions, it's still taking me a long time to unravel and isolate pieces of code to test
It sounds like one or the other piece might be introducing some kind of memory corruption or resource exhaustion that then catches up with the other operations. Unfortunately such things are notoriously hard to debug, since the offending operation isn't necessarily the one that crashes. However, if you have cases where it crashes right up front, these should at least minimize the amount of logging/tracing you have to trawl through, it's just a matter of retrying.
One thing to try out is to build the very latest kernel from the git. (6.11-rc4) as there are quite many fixes for APU's on the latest kernel. I have also seen sometime "gpu hang" issues on gfx1103 that I do not see on the gfx1035 when running something more extensive.
If you have some code that you could share that will very likely trigger the problem, that would help the testing. I have a feeling that if the problem persist even with the latest kernel, the problem can be either on the kernel side of code or on the userspace code that communicates with the kernel for sending there code and receiving responses. I may have somewhere some old notes for tracing similar problems when I traced long time ago some similar type of problems with 2400g/vega apu.
Installed the latest kernel. No luck
Turned on logging. This is the error level log. The GPU Hang doesn't appear in the error log when it happens. I'm still parsing through the "everything" log. But maybe something jumps out at you.
:1:hip_fatbin.cpp :259 : 0469299906 us: [pid:1892 tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp :112 : 0469299920 us: [pid:1892 tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp :115 : 0469299923 us: [pid:1892 tid:0x7c50f5445b80] amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp :259 : 0470625704 us: [pid:1892 tid:0x7c50f5445b80] Cannot find CO in the bundle /opt/rocm_sdk_612/lib/libhipblaslt.so.0.7.60102 for ISA: amdgcn-amd-amdhsa--gfx1103
:1:hip_fatbin.cpp :112 : 0470625717 us: [pid:1892 tid:0x7c50f5445b80] Missing CO for these ISAs -
:1:hip_fatbin.cpp :115 : 0470625720 us: [pid:1892 tid:0x7c50f5445b80] amdgcn-amd-amdhsa--gfx1103
:1:hip_code_object.cpp :624 : 0474035591 us: [pid:1892 tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8
:1:hip_module.cpp :84 : 0474035608 us: [pid:1892 tid:0x7c50f5445b80] Cannot find the function: Cijk_Ailk_Bljk_SB_MT32x32x8_SN_1LDSB0_AMAS0_BL1_BS1_EPS0_GLVWA1_GLVWB1_GRVW1_GSU1_GSUASB_ISA1103_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LRVW1_MIAV0_MMFGLC_NLCA1_NLCB1_PGR0_PLR1_SIA1_SS0_SU32_SUS256_SVW4_TT2_2_TLDS0_UMLDSA0_UMLDSB0_USFGROn1_VAW1_VSn1_VW1_VWB1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x71101ab0
I'm still having trouble isolating the problem even just to collect a log from a single command that hangs (otherwise its megabytes of text). But I'm still working on it. I'll send some code when I finally get it down to a reasonable enough length to be readable
Well it's good to know that the fix is not there in new kernel. Just to verify other thing. Once the gpu-reset happen, the system still is able to reset the gpu without you needing to do a full reboot?
I've had kernels and other ubuntu versions fully lock up. On the versions I'm using now, the GPU is able to recover. Though of course the full python process is killed
Ok, here are two Level 4 logs. One in which the crash occurs almost immediately, and another which gets past the crash point (without the stuff passed the crash point). I'm looking through them now noncrash.txt crash.txt
Let me know if you think it would be useful for me to go through and match the two logs line by line
I ended up going through matching the log anyway. Here's the Google Sheet with the comparison: https://docs.google.com/spreadsheets/d/1ZbOBMm2xRa-i0djBYTJwff2Eoee0qow9lBKyyTRBXTM/edit?usp=sharing
The two match up pretty substantially. Most discrepancies are Host wait on completion_signal=
and Host active wait for Signal =
. And then there are a couple of sections towards the end where the order of calls is shuffled a bit
EDIT: I eliminated more code
Ok, sorry I got a bit side tracked on this. Here is minimum code to cause the crash (files):
import torch
import numpy as np
from onnx import load
from onnx2pytorch import ConvertModel
import os
os.environ["AMD_LOG_LEVEL"] = "4"
if __name__ == "__main__":
model_path = "model.onnx"
device = 'cuda'
model_run = ConvertModel(load(model_path))
model_run.to(device).eval()
#It does not seem to want to crash if this line is commented out
random = np.random.rand(1, 4, 3072, 256)
while True:
print("Loop Start")
tensor = torch.randn(1, 4, 3072, 256, dtype=torch.float32, device=device)
print("The crash happens here:")
result = model_run(tensor)
Hopefully this makes it easy to diagnose the issue
First off, no matter how long I run it, if that numpy.random line isn't in there, the script doesn't crash. What could that possibly mean?
Also it looks like there are two separate crashes. One comes on malloc:
Success
hipMalloc ( 0x7ffc8851c648, 18874368 )
hipMalloc ( 0x7ffe3fa35778, 75497472 )
hipMalloc ( 0x7ffe3fa36cb8, 150994944 )
Crash
hipMalloc ( 0x7fff9307f708, 12582912 )
hipMalloc ( 0x7ffc06bbd5f8, 12582912 )
hipMalloc ( 0x7ffddf593878, 75497472 )
The other one seems to come on some sort of synchronization/lock/barrier:
Success
:4:rocvirtual.cpp :1071: 9886723816 us: [pid:4059 tid:0x73adad3b1b80] HWq=0x73ac70f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x73ac73bff900
:3:rocvirtual.hpp :66 : 9886723837 us: [pid:4059 tid:0x73adad3b1b80] Host active wait for Signal = (0x73ac73bffa80) for -1 ns
:4:rocvirtual.cpp :898 : 9887923934 us: [pid:4059 tid:0x73adad3b1b80] HWq=0x73ac70f00000, Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=1, grid=[3072, 1, 1], workgroup=[512, 1, 1], private_seg_size=0, group_seg_size=0, kernel_obj=0x73ad39ca5980, kernarg_address=0x73ac70500000, completion_signal=0x0
Crash:
:4:command.cpp :346 : 1946568251 us: [pid:2421 tid:0x7b54a1492b80] Command (CopyDeviceToHost) enqueued: 0x57c567691200
:4:rocmemory.cpp :988 : 1946568973 us: [pid:2421 tid:0x7b54a1492b80] Locking to pool 0x57c5673e0860, size 0xc01000, HostPtr = 0x57c5702b8000, DevPtr = 0x57c5702b8000
:4:rocvirtual.cpp :1071: 1946568987 us: [pid:2421 tid:0x7b54a1492b80] HWq=0x7b5364f00000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7b5367bfea00
:3:rocvirtual.hpp :66 : 1946568998 us: [pid:2421 tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfea00) for 10000 ns
:4:rocblit.cpp :750 : 1946569026 us: [pid:2421 tid:0x7b54a1492b80] HSA Async Copy on copy_engine=0x1, dst=0x57c5702b8080, src=0x7b5259e00000, size=12582912, forceSDMA=0, wait_event=0x7b5367bfea00, completion_signal=0x7b5367bfe980
:4:rocvirtual.cpp :570 : 1946569040 us: [pid:2421 tid:0x7b54a1492b80] Host wait on completion_signal=0x7b5367bfe980
:3:rocvirtual.hpp :66 : 1946569055 us: [pid:2421 tid:0x7b54a1492b80] Host active wait for Signal = (0x7b5367bfe980) for -1 ns
Just worth mentioning. It seems there are major AMDGPU changes happening in linux kernel updates recently. So probably best to wait before trying any more diagnosing of such issues: https://www.phoronix.com/news/AMDGPU-Linux-6.12-More-PQ-Reset https://www.phoronix.com/news/Linux-6.11-rc7-AMDGPU-Fix
I'm still having this random GPU Hang on my 7840U (gfx1103) and not on my 6800U (forced to gfx1030):
HW Exception by GPU node-1 (Agent handle: 0x5ab48bbcc960) reason :GPU Hang
I've been racking my head to figure out what's causing it. Deleting sections of my code. Trying to build a minimum crashing sample to provide. But sometimes it takes running many iterations of the processing I'm doing and sometimes it crashes right up front. There's a lot of code to go through, so I'm still trying narrow things down. But my guess is that the crash occurs as a result of the state of the GPU rather than the actual instruction, which makes things much trickier.
Maybe there's something much more obvious to you or an easier way to track down the issue
Some commands it has crashed on:
torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True,return_complex=False).to(device)
torch.zeros([*batch_dims, c, n - f, t]).to(device)
torch.istft(x, n_fft=self.n_fft, hop_length=self.hop_length, window=window, center=True)
torch.cuda.synchronize()
Here's the kernel log with a few of these crashes