Open happysmash27 opened 2 years ago
The error is a generic GPU timeout, meaning that something got corrupted and the GPU didn't return results as expected.
The typical response to this is a GPU reset, however it seems that in your case the reset has also failed.
Since you run an atypical PCIe configuration, I think maybe GPU reset isn't working at all for you? Try the following command and see if it gives the same error spam (WARNING: GPU reset will kill all your graphical workload so save everything before doing this)
sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
For the timeout, I honestly have no idea. It's possible that Mesa is depending on PCIe atomics in some way, which does not work in your configuration as you mentioned before. There's a small chance that this is an acominer bug, although I don't think so given it happens with other miners too and no other users have reported such hangs.
Thank you so much for the quick and knowledgeable response!
If I reset right now when nothing is happening, resetting works acceptably, with a few glitches at first but overall success. My dmesg contains the following:
[ 5258.602302] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ 5258.874321] amdgpu 0000:03:00.0: amdgpu: BACO reset
[ 5259.060660] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 5259.061938] [drm] PCIE GART of 256M enabled (table at 0x000000F400500000).
[ 5259.061955] [drm] VRAM is lost due to GPU reset!
[ 5259.125771] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5259.332365] [drm] UVD and UVD ENC initialized successfully.
[ 5259.433361] [drm] VCE initialized successfully.
[ 5259.439057] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
[ 5259.439075] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
[ 5259.439093] amdgpu 0000:03:00.0: amdgpu: GPU reset(1) succeeded!
[ 5259.439159] [drm] Skip scheduling IBs!
[ 5260.124776] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5260.147738] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5260.299254] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5261.125595] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5261.800112] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5262.125534] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5262.125609] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5263.125618] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5263.300595] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5264.801740] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5264.801744] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5265.125582] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5266.124915] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5266.303067] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5267.126019] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5267.803650] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5268.125449] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5268.125524] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5269.009588] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5269.143642] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5270.125148] amdgpu_cs_ioctl: 2 callbacks suppressed
[ 5270.125153] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5270.805663] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5271.125429] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5272.125532] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5272.306532] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5273.125071] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5273.807558] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5274.125146] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5275.125142] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5275.308223] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5276.125679] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5276.809169] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5277.125857] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5278.126085] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5278.310105] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5279.125354] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5279.125448] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5279.811053] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5280.125636] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5281.125751] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5281.312006] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5282.124939] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5282.812832] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5283.125471] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5284.124768] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5284.313666] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5285.124461] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5285.148289] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5285.814552] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5287.125286] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5287.125291] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5287.315461] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5288.125452] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5288.816122] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5289.125110] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5290.125597] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5290.316520] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5290.316635] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5291.125326] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5291.817888] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5293.125287] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5293.125291] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5293.318223] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5294.125071] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5294.818622] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5295.124570] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5296.125003] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5296.318969] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5297.124711] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5297.819455] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5298.124403] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5299.124951] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5299.319941] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5300.124731] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5300.820387] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5301.124320] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5302.125889] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5302.320854] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5302.320975] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5303.124802] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5303.821449] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5305.125874] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5305.125878] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5305.321793] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5306.124635] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5306.822309] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5307.124206] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5307.148012] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5308.125322] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5308.323363] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5309.125863] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5309.824585] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5311.124991] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5311.124995] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5311.324901] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5312.124817] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5312.825613] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5313.124527] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5314.125054] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5314.125105] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5314.325921] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5315.125015] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5315.826768] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5317.125248] amdgpu_cs_ioctl: 1 callbacks suppressed
[ 5317.125252] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5317.327114] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5318.125262] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5318.828042] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5319.124928] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5320.125559] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5320.328389] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5321.125513] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5321.829222] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5322.125106] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5323.125787] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5323.329650] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5324.124727] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5324.830579] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5325.124706] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5326.124244] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5326.124285] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5326.149110] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5326.330613] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
[ 5327.124851] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser -125!
There have also been one or two times when SteamVR crashed things and the manual reset worked successfully as well. I am aware of this command and usually try to invoke it when things go wrong, but although it does not work 80+% of the time, it does work occasionally. I have also had automatic resets work at least once.
Would you happen to know what amdgpu_gpu_recover printing "-11" might mean? When I tried to reset the GPU during the first acominer-related crash, it always printed that when I tried to recover but I can find no results about what this actually means online.
Hmm, thank you, and I see that GPU reset is working fine (Failed to initialize parser -125
just means that the applications need to be restarted because their context are lost).
Would you happen to know what amdgpu_gpu_recover printing "-11" might mean? When I tried to reset the GPU during the first acominer-related crash, it always printed that when I tried to recover but I can find no results about what this actually means online.
It probably means that a GPU reset is already in progress, I don't know if -11
is actually an errno but if it's errno then it would mean "Resource temporarily unavailable". Since the GPU reset fails it probably will never end, bringing the system into a hanging state.
I honestly don't have an idea how this can be solved, but here's an attempt anyway. https://github.com/ishitatsuyuki/acominer/actions/runs/1733336306
@happysmash27 Did you have a chance to try out the experimental build (ishitatsuyuki/acominer/actions/runs/1733336306)? No need to hurry, just let me know if it didn't work.
This has happened twice now in the past two weeks, and both times the GPU fails to recover with these four messages:
Over and over again, about two seconds apart.
It is impossible to recover from this this without a hard reset. GPU reset fails, shutdown does not commence, magic sysrq does not reboot, and even the reset button on my case does not work. I have to actually hold the power button to reset it manually. This is very problematic, as it causes downtime for my mining pool and several servers.
Apparently dmesg only sends lines up to a certain amount, so the start of the error the first time on January 15th is lost. The second time, however, I did manage to catch it:
If you know of anywhere better to send this issue, I would really appreciate that as well. I can get similar issues with some other Ethereum miners and with SteamVR, but am not sure if I can put this in a bug report to Mesa, since it uses a custom fork rather than the official one.
This seems to happen more often when I am doing something else in 3D. The first time, was when I launched the miner before Cities: Skylines had completely closed, which is relatively understandable since CS uses a crazy amount of VRAM. The second time, however, I was only zooming in in Blender, with Google Earth running far in the background. I thought I could avoid the crashes by not running anything GPU-intensive, but it appears that this is not the case.