Error on load (zeroed residue) with ROCm 3.1 - introduced by commit 7c09e38 (21 Jun 2020)

a-repko commented 4 years ago

I'm getting the following error with ROCm 3.1 and linux kernels 4.19 and 5.4, on both APU Raven Ridge (tried FFT 8M) and Radeon VII (FFT 18M with a different exponent): 149250001 EE 10000 loaded: blockSize 400, 0000000000000000 (expected 93a8c0647456f82d) The error occurs also for a new run. I traced down the bug to commit 7c09e38 (21 Jun 2020).

I'm now using ROCm 3.1 due to a defective kernel's compilation of mfakto 0.15pre7 with ROCm 3.3 (currently, I don't have a working ROCm 3.3 installation to check the upstream mfakto; or to check if the above error is indeed specific to ROCm 3.1).

selroc commented 4 years ago

I am also using mfakto with ROCm 3.1, it fails with 3.3 I have kernels 5.4, but I run gpuowl with ROCm 3.3 and I don't get this error.

preda commented 4 years ago

@a-repko can you please try to run with -safeMath and let us know if the error is still present.

How easy is to reproduce?:

do you see it for each exponent?
do you see it for some exponent, every time?
or for some exponent, some time?

a-repko commented 4 years ago

So I tried now 1xLL + 3xPRP on APU Raven Ridge, gentoo linux with kernel 4.19, ROCm 3.1, this time as fresh jobs with no save-files:

snapshot at commit 7c09e38 (21 Jun 2020): 56606819 FFT: 3M 1K:6:256 (17.99 bpw) 56606819 LL 0 loaded: 0000000000000004 56606819 LL 2000 0.00%; 15442 us/it; ETA 10d 02:48; 0000000000000000 91789433 FFT: 5M 1K:10:256 (17.51 bpw) 91789433 EE 0 loaded: blockSize 400, 0000000000000000 (expected 0000000000000003) 149250083 FFT: 8M 1K:8:512 (17.79 bpw) 149250083 EE 0 loaded: blockSize 400, 0000000000000000 (expected 0000000000000003) 332500879 FFT: 18M 1K:9:1K (17.62 bpw) 332500879 EE 0 loaded: blockSize 400, 0000000000000000 (expected 0000000000000003) All the same with -safeMath, as well as with the newest version of gpuowl.

snapshot at previous commit 635c455 (21 Jun 2020): 56606819 FFT: 3M 1K:6:256 (17.99 bpw) 56606819 LL 0 loaded: 0000000000000004 56606819 LL 3000 0.01%; 18975 us/it; ETA 12d 10:21; 37ec703fc5095c63 91789433 FFT: 5M 1K:10:256 (17.51 bpw) 91789433 OK 0 loaded: blockSize 400, 0000000000000003 149250083 FFT: 8M 1K:8:512 (17.79 bpw) 149250083 OK 0 loaded: blockSize 400, 0000000000000003 332500879 FFT: 18M 1K:9:1K (17.62 bpw) 332500879 OK 0 loaded: blockSize 400, 0000000000000003 And Gerbicz checks are fulfilled in the following iterations.

OK, I will recompile ROCm to version 3.3, to be sure that the problem is indeed there. As for mfakto, I will just keep the properly compiled kernels (from ROCm 3.1), which can be then run even on ROCm 3.3. New results will be tomorrow.

a-repko commented 4 years ago

The error indeed goes away with ROCm 3.3 (both with and without -safeMath). If you decide to revert that commit, I can test it again with ROCm 3.1 on another machine (with Radeon VII).

Moreover, there is also some 1% speed regression (at least for FFT 18M on Radeon VII, ROCm 3.1, kernel 5.4) introduced somewhere between 15 May and 21 Jun. I will investigate this later. If it appears due to increased precision, I can live with that.

valeriob01 commented 4 years ago

effectively I am noticing a speed slowdown probably because the FFT went from 5.50M for 104M/105M exponents to 6M for 106M exponents. FFT 18M seems like before, it is hard to notice a 1% slowdown when the timing is not stable.

a-repko commented 4 years ago

Brief testing showed that the slowdown appeared first between May 15 and May 22, with some fluctuation (up to 1%) among later commits, so I chose Jun 2 for now, which has slightly better speed (I need Jacobi, but not yet the proof). This fluctuation is probably related to the different optimization machinery of ROCm 3.1, so you can safely ignore it. I will later briefly recheck the speed if you decide to make the upstream gpuowl version ROCm 3.1-friendly.

In the meantime, I found out that mfakto fails in self-test for Radeon VII in some runs of cl_barrett15_70_gs_2 kernel (and also in clbarrett15{69,71,73,74}_gs_2 kernels when -st2 is run). Forcing GCN5 (or using only appropriate exponents / bit-depths) solves the problem (at least for -st). I haven't opened the issue in Bdot42/mfakto yet, since I don't have time for detailed testing, and I expect that occasional mfakto-runners on Radeon VII perhaps come here more often anyway.

preda commented 4 years ago

I'm going to close this, please re-open if it's still an active issue.

a-repko commented 4 years ago

OK, I tried again the upstream version of gpuowl, and the error is still there. Nevertheless, I will close this issue, as the ROCm still seems to be considered as work-in-progress, and the old versions are rapidly abandoned, although some of their functionality may get lost. Let me just briefly summarize the functionality of ROCm versions: ROCm 3.1 - mfakto works, gpuowl - only old versions work (635c455 and before) ROCm 3.3 - mfakto doesn't work (but you can keep the old mfakto_Kernels.elf file, which will work), gpuowl works ROCm 3.5 - mfakto works, gpuowl gave some errors (wrong residue), I didn't investigate this further, and didn't check with upstream version. If somebody has some problem there, he can open a new issue. ROCm 3.7 and 3.8 - doesn't work with APU Raven Ridge (AMD Ryzen 2000G/H/U), see RadeonOpenCompute/ROCm#1219, but otherwise it seems that mfakto works on dGPU; gpuowl was not checked (namely, I recently tried it with R9 Nano (gfx803), which provided only OpenCL 1.2)

So it seems that this bug-report is useful only for somebody, who is trying to use integrated graphics with ROCm to run mfakto and gpuowl: It seems that he should first install ROCm 3.1, compile and save mfakto kernels for various ini-parameters (for example, GCN3 appears to use much less power than GCN5, while being slightly slower), and then reinstall to ROCm 3.3.

Another possibility is to use AMDGPU-PRO, see install script at https://gist.github.com/kytulendu/3351b5d0b4f947e19df36b1ea3c95cbe which I tested on Raven Ridge just now (currently AMDGPU-PRO 20.40), and it works well for both mfakto and gpuowl (speed may vary by ca. +- 5%, compared to ROCm). You will probably need to uninstall rocm-opencl-runtime package before.

preda / gpuowl

Error on load (zeroed residue) with ROCm 3.1 - introduced by commit 7c09e38 (21 Jun 2020) #178