Closed valeriob01 closed 5 years ago
2019-09-07 12:06:44 90348611 33410000 36.98%; 886 us/sq; ETA 0d 14:01; bac38bb8e27196e5 2019-09-07 12:06:53 90348611 33420000 36.99%; 886 us/sq; ETA 0d 14:01; 5dc04e6cd38ab191 2019-09-07 12:07:02 90348611 33430000 37.00%; 887 us/sq; ETA 0d 14:01; b91d6d315cae4932 Queue at 0x7f23e803a000 inactivated due to async error: HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction.
This needs reboot.
I don't know if gpuowl registers an error in this case, I don't think so. This is a severe error that blocks the program. The nErrors indication can only capture certain events, I would say "less severe than this one".
Hi, I've never encountered this error myself; probably I'll have to wait until I can repro.
Hi Mihai, this was a one-time error, never reproduced myself, but I have 2 radeon7 and both show the same computation errors including the all-zero residue error. I have tested them also on separate and different mainboards and on Debian and on Ubuntu, the computation errors are common. It seems to me that the dealer got a batch of buggy Radeon VII.
I don't know, I also don't see the all-zero.. Could be many things causing it... we need more information.
One thing I can say is that on all-zero occurrence corresponds a page fault.
Also, more information here: On occurrence of all-zero error, the error is repeated over and over until the next Gerbicz Check, which fails, then on reload the error may disappear. Then it may reappear randomly. I have also seen 3 consecutive errors, which make gpuowl exit. I have observed scrupulously this behaviour, the error rate tend to increase with temperature. By cooling the gpu very well I can keep this error to a minimum of occurrences. But still, I cannot eliminate it reliably. Tested on two different mainboards, and cpus, ram, hard disk, with two different Radeon VII.
Just happened again, on the dual radeon 7 system, the gpu in error is at rest now, gpuowl has been killed, but the other gpu is still working and computing. I thought the error was more severe, but I need to reboot to restart the gpu in error.
Are you using PCIe raisers?
Are you using PCIe raisers?
No. ROCm doesn't support pci risers. Risers are a thing of the past for me. Maybe the source of errors is some other component involved in the computation.
Are you using PCIe raisers?
No. ROCm doesn't support pci risers. Risers are a thing of the past for me. Maybe the source of errors is some other component involved in the computation.
However, Radeon VII is the only cpu model to see these errors. Other gpus I have, RX580 and Verga64 never seen a single error...
I typed an r in excess, that's Vega64 ! Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan. This is a new account I created to divide my work.
I went on and installed Debian 10.1 with ROCm 2.8, this seems to have reduced the errors a great amount, and the all-zero residue error has not occurred until now.
I typed an r in excess, that's Vega64 ! Well, I will investigate if the RAM is suffering from being too near the CPU cooler fan. This is a new account I created to divide my work.
I will just use mprime stress test to verify the RAM.
Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.
Are you overclocking the GPU RAM, or undervolting? if so, maybe that is too aggressive.
The irony is that I never touch voltage/clock settings, it is just that I have found a way to cool the gpu very well. With Debian 10.1 things are going better, the number of errors has reduced by 90%
This needs reboot.