Radeon VII, Severe error: probably ROCm related

valeriob01 commented 5 years ago

2019-09-07 12:06:44 90348611    33410000 36.98%;  886 us/sq; ETA 0d 14:01; bac38bb8e27196e5
2019-09-07 12:06:53 90348611    33420000 36.99%;  886 us/sq; ETA 0d 14:01; 5dc04e6cd38ab191
2019-09-07 12:07:02 90348611    33430000 37.00%;  887 us/sq; ETA 0d 14:01; b91d6d315cae4932
Queue at 0x7f23e803a000 inactivated due to async error:
        HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION:  The agent attempted to execute an illegal shader instruction.

This needs reboot.

valeriob01 commented 5 years ago

2019-09-07 12:06:44 90348611    33410000 36.98%;  886 us/sq; ETA 0d 14:01; bac38bb8e27196e5
2019-09-07 12:06:53 90348611    33420000 36.99%;  886 us/sq; ETA 0d 14:01; 5dc04e6cd38ab191
2019-09-07 12:07:02 90348611    33430000 37.00%;  887 us/sq; ETA 0d 14:01; b91d6d315cae4932
Queue at 0x7f23e803a000 inactivated due to async error:
        HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION:  The agent attempted to execute an illegal shader instruction.

This needs reboot.

I don't know if gpuowl registers an error in this case, I don't think so. This is a severe error that blocks the program. The nErrors indication can only capture certain events, I would say "less severe than this one".

preda commented 5 years ago

Hi, I've never encountered this error myself; probably I'll have to wait until I can repro.

valeriob01 commented 5 years ago

Hi Mihai, this was a one-time error, never reproduced myself, but I have 2 radeon7 and both show the same computation errors including the all-zero residue error. I have tested them also on separate and different mainboards and on Debian and on Ubuntu, the computation errors are common. It seems to me that the dealer got a batch of buggy Radeon VII.

preda commented 5 years ago

I don't know, I also don't see the all-zero.. Could be many things causing it... we need more information.

valeriob01 commented 5 years ago

One thing I can say is that on all-zero occurrence corresponds a page fault.

valeriob01 commented 5 years ago

Also, more information here: On occurrence of all-zero error, the error is repeated over and over until the next Gerbicz Check, which fails, then on reload the error may disappear. Then it may reappear randomly. I have also seen 3 consecutive errors, which make gpuowl exit. I have observed scrupulously this behaviour, the error rate tend to increase with temperature. By cooling the gpu very well I can keep this error to a minimum of occurrences. But still, I cannot eliminate it reliably. Tested on two different mainboards, and cpus, ram, hard disk, with two different Radeon VII.

valeriob01 commented 5 years ago

Just happened again, on the dual radeon 7 system, the gpu in error is at rest now, gpuowl has been killed, but the other gpu is still working and computing. I thought the error was more severe, but I need to reboot to restart the gpu in error.

gpuerror

preda commented 5 years ago

Are you using PCIe raisers?

valeriob01 commented 5 years ago

Are you using PCIe raisers?

No. ROCm doesn't support pci risers. Risers are a thing of the past for me. Maybe the source of errors is some other component involved in the computation.

valeriob01 commented 5 years ago

Are you using PCIe raisers?

No. ROCm doesn't support pci risers. Risers are a thing of the past for me. Maybe the source of errors is some other component involved in the computation.

However, Radeon VII is the only cpu model to see these errors. Other gpus I have, RX580 and Verga64 never seen a single error...

selroc commented 5 years ago

I typed an r in excess, that's Vega64 ! Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan. This is a new account I created to divide my work.

valeriob01 commented 5 years ago

I went on and installed Debian 10.1 with ROCm 2.8, this seems to have reduced the errors a great amount, and the all-zero residue error has not occurred until now.

valeriob01 commented 5 years ago

I typed an r in excess, that's Vega64 ! Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan. This is a new account I created to divide my work.

I will just use mprime stress test to verify the RAM.

valeriob01 commented 5 years ago

https://github.com/RadeonOpenCompute/ROCm/issues/873#issuecomment-538859418

preda commented 5 years ago

Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.

selroc commented 5 years ago

Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.

The irony is that I never touch voltage/clock settings, it is just that I have found a way to cool the gpu very well. With Debian 10.1 things are going better, the number of errors has reduced by 90%

preda / gpuowl

Radeon VII, Severe error: probably ROCm related #63