Open protox opened 6 years ago
AMD asked me, and probably other clients, during the RMA to specifically try higher SOC voltages. It still crashed. So if you don't check it yourself in advance, AMD will probably ask you to rule out memory issues.
The test is sensitive to failures in different subsystem including memory corruption. Generally, I would propose to first get a stable system [1] and then look for the bug in your Ryzen processor. However, changing the SOC to higher values goes along with increasing the power consumption (probably beyond the 65W in case of the R7 1700), which is a workaround and no solution. Running the memory outside the spec (i.e., overclocking the interconnect) to get a stable system does not seem a good solution to me.
In my opinion, a system must be stable with the default settings, without any tuning or tweaking. If it is not, there is a problem and the manufacturers should help you troubleshooting, be it a processor bug, defective memory, or wrong BIOS settings.
Feel free to propose an addition to the readme and send a pull request.
For AMD there seems to be a major disconnect between motherboard & ram manufactures. All I am saying is perhaps running on the lowest possible specification isn't a good idea for this test, because it is extremely extensive.
I'm not an expert in this field, so I'll leave it for someone else to make pull requests and decide if such a suggestion is a good idea or not.
At this same time it seems like this test is currently the primary way people are determining if they have a faulty CPU or not. It could be that a tiny bump to VDDCR_SOC can be enough to make a post week 25 system stable to avoid this issue and avoid an RMA?
You are completely right, and objectively it seems to be a good suggestion. But subjectively I run out of patience.
I am utterly frustrated by AMDs communication on this issue and I am not willing to help them any further:
If they still(!) don't care, I don't mind unsettled users bugging them. It's up to AMD to say something, to do something, to provide a tool!
I completely agree with you, AMD needs to do FAR more and at least come out and tell us what the issue is and which CPUs are affected. People are jumping through hoops, paying for shipping in certain countries and getting long delays to get their RMA at the moment.
If anyone thinks about solution not requiring RMA, there seemingly is one (did not test on many CPUs, just on mine 1800X from week 22): I've disabled CPU micro Op Cache (called uOpCache/Op Cache/...). On i.e. ASUS X370 boards this requires a modded BIOS enabling AMD CBS menu, some other board may allow that on stock BIOS.
After disabling uOpCache, tests (both this one and windows 'bzip2 compiler' killer) stopped crashing and could run for hours without any crash until terminated manually. Performance loss is negligible (around ~3% on 7zip and compilation times, winrar seemingly even slightly (~1%) benefits from it in multi-threaded mode). According to some people noticing slight performance drops on 'fixed' Ryzens, these probably just come with uOpCache internally disabled or limited.
If you are getting segfaults with kill-ryzen double check your RAM settings, on my Taichi X370 motherboard with my UA1733PGS when you load default settings, RAM will drop to 2133MHz and VDDCR_SOC will go down to 0.880V with DRAM at 1.2V. If kill-ryzen is run on this setting, you get a segfault in a couple of minutes.
However loading 2933MHz XMP profile, will bump VDDCR_SOC to 1.096V & DRAM Voltage to 1.368V, and kill-ryzen is then able to run for many hours with zero issues.
Voltage is probably far too low, especially on default SOC for such an extensive test.
More info can be found here: https://www.reddit.com/r/Amd/comments/7ho4uv/is_the_culprit_of_linux_segfault_on_ryzen_cpu/
Make sure to double check your settings or you might think you have a faulty CPU, when it's not, especially if it's made after week 25.
Might be a good idea to add some of this information to the readme? I've read many forum posts recommending running this script on 2133MHz and default settings, which can be a problem.