suaefar / ryzen-test

Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux
GNU General Public License v3.0
224 stars 59 forks source link

Is my CPU buggy? #27

Open PhantomR opened 6 years ago

PhantomR commented 6 years ago

Hey, I ran ryzen-test for more than one hour on this config: EDIT: OS: Ubuntu 18.04 LTS Ryzen 5 1600 @ stock speed MSI B350 Tomahawk @ latest BIOS 2x4GB RAM @ default BIOS speeds/voltage (2133MHz) IMPORTANT: My RAM is actually not officially supported on my motherboard according to its MSI page. The RAM model I have is CMK8GX4M2B3000C15. From time to time my PC makes 3 beeps and autorestarts upon powering on/manual restart.. probably it is because of this incompatibility, but the system runs with no problem except for this rare boot problem.

Can you please confirm that 'build failed' means segfault? (See picture below) NOTEs: I actually built GMP, then NTL, then HElib, all of them while ryzen-test was running. After I started the NTL build ( I think) , the system became terribly unresponsive and hardly usable. It seemed to have come back to life after some of the ryzen-test processes failed, but then, even though all process had seemigly crashed, it went back to being almost unresponsive (I think all these lags were because the builds ate up all my RAM and the OS started to use swap space.. strangely enough, the RAM was not released after the processes failed apparently). EDIT: the builds for those 3 libraries actually completed successfully... do you think it could have been them that lead to the failure of the ryzen-test processes? But even if they did, should that actually happen?

This is the picture of the result: https://imgur.com/a/vh25LH0 I also wanted to take a shot of the System Resources screen, but didn't have enough patience to stay through that terrible unresponsiveness anymore. In any case, what I last saw was that the RAM was like 7.7GB out of 7.8 consumed and the swap space was in use: ~700-800MB out of ~10GB (I set 10GB for this test.. because I read it needs around 16GB of RAM, but even that could be insufficient).

Oxalin commented 6 years ago

Hi @PhantomR You say your RAM is not officially supported. However, can you tell us more: are you using the latest BIOS available? Are you using a XMP (or A-XMP, but I don't think it is available) profile? I don't think 2133 is the "default" speed, but rather the maximum speed supported by the RAM using one of the XMP 2.0 profiles available.

A "build failed" doesn't necessarily mean it encountered a segfault. You should be able to have a look at the build logs to see what is the error.

Now, by default, the script doesn't flush the RAM disk on exit. You have to set the parameter. From what I understand from your description, you are probably hitting the full capacity of your RAM, thus having a sluggish system and encountering build failures. Try lowering the number of loops (keep the maximum threads by increasing the number of threads per loop).

Tell us more once you'll have done these verifications.

PhantomR commented 6 years ago

The RAM is not on XMP (it's disabled by default in the BIOS). My RAM is actually rated at 2933MHz in its specs and there is an XMP profile for that, but I thought it's better to keep it at the default setting for the test. Do you think it would be a good idea to try setting the speeds/voltage to the factory ones manually?

The BIOS is also the latest (I actually mentioned this in my original post :) "@latest BIOS").

About the logs.. I wanted to, but I didn't know where they are.. I manged to find them now.

So, I tried your suggestion and did 4 processes and 3 threads each (params 4 3).. the result is below (each iteration failed with the same exact error and all 4 failed almost simultaneously (3 of them after 444s, one of them after 445s).

LOG: https://justpaste.it/5zdpg

It appears this is not segfault, is it? I wonder what's causing this error. Could it be the fact that I'm running Ubuntu 18.04 and not the recommended 17.04? EDIT: I actually found this is actually a bug in the 7.1.0 sources that the script downloads. I'll try modifying the script to download newer sources (seemingly 7.2.0 had fixed this bug) and run it again.

THANK YOU very much for your help!!

Oxalin commented 6 years ago

Sorry, I missed the "latest BIOS". However, owning myself an MSI motherboard, I can only suggest you to have a look online on MSI's website to see if there is any newer BIOS. I had to manually download and install the latest BIOS on a X370 Gaming Pro a few weeks ago because the provided Updater would not find the newer version. This should only be related to your random reboot though.

Indeed, as you identified in your "EDIT", you need to use an updated GCC version. See the following link to better understand the problem: https://github.com/suaefar/ryzen-test/issues/6#issuecomment-335323283

You could try my modified script which can be found on my GitHub account, it uses GCC 7.2.0: https://github.com/Oxalin/ryzen-test

However, as @suaefar stated often in many issues related to this bug: by changing the tools, we can't be sure that the hardware bug can be reproduced under the same conditions. Personally, I had tested a modified suaefar's script (that built GCC 7.2.0) with my previous buggy CPU under ArchLinux and that is how I had been able to identify that it was defective (I had encountered segfaults while building AUR packages previously, so everything was pointing at a problematic CPU).

PhantomR commented 6 years ago

Once again, thank you for your help :). To be honest, I've restarted the PC quite a lot since yesterday when I updated the BIOS and the beeping thing doesn't seem to occur anymore, which is interesting :D. I've also checked just now and there's no newer BIOS.

I downloaded your version since my modification used 7.3.0 and I hope 7.2.0 is more likely to cause segfaults. Sadly, I tried running it with 4 processes and 3 threads each and after a while my screen turned black and the computer seemed laggy (I could only see the mouse pointer). So, I gave it some thought and went for 1 process and 12 threads.. So far this has been successfully running for about 3 hours and has just begun "start 7" of its (only) loop. Do you think this is relevevant, however? Maybe I should have put at least 2 processes, but I was afraid I'd be wasting time again if it used all RAM again..

UPDATE: The test has been running with no errors for the last ~8 hours. I'm going to stop it now. I wonder if I should keep trying with 2 and 3 processes.. what do you suggest? Also, is compiling the GCC 7.1.0 sources supposed to actually work with Ubuntu 17.04???

Oxalin commented 6 years ago

@PhantomR : Hi again, sorry for the silence of the last few days. You could launch the script with 2 loops, but I don't think you'll find anything different. At the time when I had a faulty CPU, I hit build errors (segfaults) with other applications while installing AUR packages (on ArchLinux). So there was only 1 application being built with a maximum of threads and it wouldn't have to run for many hours to happen.

PhantomR commented 6 years ago

@Oxalin No worries :). Thank you for all your help!! I thought I may have the segfault issue because a program I was running was failing from time to time (at runtime, compilation worked just fine). In the end, I found out that it was due to a bug in the program itself (which I have not found yet), since the same segfault errors occured while running the same program on an Intel i5 mobile 2nd gen CPU. I think I may be fortunate enough that my CPU is not buggy, but I'd like to do some more testing when I have more time (in a month or so). Once again, thank you!! I'll come back here and report my results / ask more questions should I do more testing. Best wishes :).

nift4 commented 5 years ago

The README suggests 16GB RAM