suaefar / ryzen-test

Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux
GNU General Public License v3.0
224 stars 59 forks source link

Correct testing procedure #20

Closed CRCinAU closed 7 years ago

CRCinAU commented 7 years ago

So my CPU used to crash all the time and be affected by this problem. The failures were under 10 minutes - with most happening in the first 2-3 minutes.

Fast forward to today, and I started the RMA process with AMD. They want a screenshot of what is happens when I run the test and it fails. But - One problem, I can't get my CPU to fail anymore.

I'm running Fedora 27, which has:

$ rpm -qa | grep gcc | sort
gcc-7.2.1-2.fc27.x86_64
gcc-c++-7.2.1-2.fc27.x86_64
gcc-gdb-plugin-7.2.1-2.fc27.x86_64
libgcc-7.2.1-2.fc27.i686
libgcc-7.2.1-2.fc27.x86_64

I've tried running with the args of 8 2 as well as 4 4 or even 8 4 - but nothing seems to cause the segfault.

Gigabyte have a newish 'opcache control' on the AB350 Gaming 3 mainboard - and that doesn't seem to make any combination fail - I've tried the same tests for 2+ hours with it set to all three options - Auto, Enabled, Disabled.

So, has this been fixed in software somehow? Has gcc fixed the compiler? has something else happened?

Thoughts?

Oxalin commented 7 years ago

Fedora 27 uses glib 2.26. GCC 7.1.0, previously used by the script, had an incompatibility with glib 2.26, which has been fixed under 7.2.0. What you were experiencing was most probably not a random segmentation fault under heavy load but the result of that incompatibility. That was what I was encountering under ArchLinux. However, segmentation faults were still generated with my faulty CPU when I moved to GCC 7.2.0. If you don't experience it anymore, you most probably don't have a faulty CPU. You can use a previous version of the script to be sure or change the version downloaded by the current version to go back to 7.1.0.

See issue #6 for all the details (and mostly my last comment pointing to the GCC / glib official bug report).

CRCinAU commented 7 years ago

That's an interesting bit of info.... Would it be worthwhile adding some of that to README.md to make things a little more obvious?

Oxalin commented 7 years ago

Yes, I agree that it would be a good idea.

suaefar commented 7 years ago

This is problematic. @CRCinAU could you try to use ubuntu 17.04. (may be as a live system from a usb drive) and try the following version of the script: https://github.com/suaefar/ryzen-test/tree/ac77b35195c45805ab1d795a13f012d19337ce44 Maybe we need to revert. To trigger the bug reliably pretty specific parallel instrucion patterns seem to be needed.

CRCinAU commented 7 years ago

@suaefar I'm happy to give things a go.... Do you have a link for something USB bootable in the ubuntu land? I'm not at all an ubuntu person....

suaefar commented 7 years ago

...and these might not be achieved on all plattforms, distributions, versions of libraries, etc. If this turns out to be true, maybe I will again restrict the script to one version of one distribution.

suaefar commented 7 years ago

@CRCinAU This should work: http://de.releases.ubuntu.com/zesty/ubuntu-17.04-desktop-amd64.iso

CRCinAU commented 7 years ago

Ok - so I tried that ISO on a USB stick and couldn't get to a prompt or desktop after ~10 minutes... Seems to not like my system - either the nVidia graphics card - or something else....

suaefar commented 7 years ago

That's unfortunate. I don't have any faulty CPUs anymore to test it myself. But what you experience (script does not trigger the bug anymore) is what I fear with every change to the script.

Oxalin commented 7 years ago

If I may suggest, use Fedora as you would, go back to commit ac77b35195c45805ab1d795a13f012d19337ce44 (before moving from GCC 7.1.0 to 7.2.0) and run the script. This should generate the same erros as you were seeing before. Open the logs after it outputs an error and paste it here. We'll compare the error with the one I was experiencing and with the one reported in GCC's bug database.

CRCinAU commented 7 years ago

I can't see any changes in the compilation procedure in the script that would cause it to fail - but changing the instances of 7.2.0 to 7.1.0 should have the same effect....

I just ran: s/7.2.0/7.1.0/g

Then changed the extension from .xz to .bz2...

It's running now as $ ./crc-kill-ryzen.sh 8 4 - where in my version I pretty much only change the download location....

EDIT: Seems every build failed after 580 or so seconds - so I figure theres something else going on.... Reverted to commit id ac77b35195c45805ab1d795a13f012d19337ce44 - will test that...

CRCinAU commented 7 years ago

ok - so even now - I can't quite get it to fail the same way.....

I did end up getting the following failure - with just about all of the loops at the same time - no matter how many I tried:

/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/xgcc -B/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem /usr/local/x86_64-pc-linux-gnu/include -isystem /usr/local/x86_64-pc-linux-gnu/sys-include    -g -O2 -O2  -g -O2 -DIN_GCC    -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -I. -I. -I../.././gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/. -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../include -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS  -DUSE_TLS -o unwind-dw2.o -MT unwind-dw2.o -MD -MP -MF unwind-dw2.dep -fexceptions -c /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c -fvisibility=hidden -DHIDE_EXPORTS
In file included from /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c:403:0:
./md-unwind-support.h: In function 'x86_64_fallback_frame_state':
./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext'
       sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
                                               ^~
make[3]: *** [/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/shared-object.mk:14: unwind-dw2.o] Error 1
make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0/x86_64-pc-linux-gnu/libgcc'
make[2]: *** [Makefile:21950: all-stage1-target-libgcc] Error 2
make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make[1]: *** [Makefile:27079: stage1-bubble] Error 2
make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make: *** [Makefile:942: all] Error 2

From what I understand, this is the problem with v7.1.0 that is fixed in v7.2.0?

suaefar commented 7 years ago

We don't know exactly what triggers the bug. We only know how to trigger it running certain software (in probably very specific version). Hence, changing anything (even the timezone, the room temperature, and certainly any software update or even the distribution) can result in missing the "desired" instruction pattern.

To have a "reproducible" outcome the bits really should be the same. I think I will need to revert everything to when it worked for me with a clean Ubuntu 17.04 live USB distro.

This script has only one purpose: Reliably trigger the bug! It does not really matter with which distribution or software or which version. However, for me it worked in a certain configuration which I cannot test anymore. So, without (massive) user input I only have one option: Revert!

No distribution other than Ubuntu 17.04. will be supported for now.

CRCinAU commented 7 years ago

I'm not so sure its a distro problem - as even on Fedora 27 beta, I could hit a problem - and it was a segfault - but I'm wondering if it was a problem with gcc 7.1.0 all along - as the crash I see now was different - but even in the same version (ie the older commit), I can't reproduce the fault....

This actually makes me wonder if it was something different that fixed the problem.... If so, this would probably be a good thing for users.... Hmmmm...

suaefar commented 7 years ago

I invested waaaay to much time into this already. I just wanted a fast, shiny, new CPU from AMD. From that point of view everything around this issue was and still is very disappointing.

I don't want to continue this semi-supported script without official word from AMD acknowledging they have a big f...... problem.

Please fork the code and I will add a link to it in the README.

Oxalin commented 7 years ago

@CRCinAU The error you encountered is the expected one arizing from the incompatibility between gcc 7.1.0 and glib 2.26. You could have catch a segfault along the way (it was happening on my setup) if the CPU was to hit the bug before it had reached the problematic gcc's code.

While I had my faulty CPU, on ArchLinux, I had hit the segfault here and there while using the original script (with gcc 7.1.0) even if most of the time I was ending with the gcc VS glib build error. I had also encountered segfault while building some AUR packages. The CPU bug can be triggered by any heavy load as far as I know. but the exact conditions to hit it are still unknown.

Since @suaefar don't want to spend more time on support his script outside of Ubuntu 17.04, I'll create a fork in the next few days. I'll be happy to support testers as far as I can, even without any faulty CPU at hand.

Oxalin commented 7 years ago

Also do you have your batch number easily available (its on the cpu, see this: https://www.reddit.com/r/Amd/comments/6scnlg/ryzen_reading_your_production_batch_number/)

CRCinAU commented 7 years ago

I don't have it handy - however the other ones my retailer had in stock were 1705 - so I wouldn't be surprised if mine is around the same.... If I find my white tube of thermal goop somewhere, I'll get the watercooler off and take a look...