Closed CRCinAU closed 7 years ago
Fedora 27 uses glib 2.26. GCC 7.1.0, previously used by the script, had an incompatibility with glib 2.26, which has been fixed under 7.2.0. What you were experiencing was most probably not a random segmentation fault under heavy load but the result of that incompatibility. That was what I was encountering under ArchLinux. However, segmentation faults were still generated with my faulty CPU when I moved to GCC 7.2.0. If you don't experience it anymore, you most probably don't have a faulty CPU. You can use a previous version of the script to be sure or change the version downloaded by the current version to go back to 7.1.0.
See issue #6 for all the details (and mostly my last comment pointing to the GCC / glib official bug report).
That's an interesting bit of info.... Would it be worthwhile adding some of that to README.md to make things a little more obvious?
Yes, I agree that it would be a good idea.
This is problematic. @CRCinAU could you try to use ubuntu 17.04. (may be as a live system from a usb drive) and try the following version of the script: https://github.com/suaefar/ryzen-test/tree/ac77b35195c45805ab1d795a13f012d19337ce44 Maybe we need to revert. To trigger the bug reliably pretty specific parallel instrucion patterns seem to be needed.
@suaefar I'm happy to give things a go.... Do you have a link for something USB bootable in the ubuntu land? I'm not at all an ubuntu person....
...and these might not be achieved on all plattforms, distributions, versions of libraries, etc. If this turns out to be true, maybe I will again restrict the script to one version of one distribution.
@CRCinAU This should work: http://de.releases.ubuntu.com/zesty/ubuntu-17.04-desktop-amd64.iso
Ok - so I tried that ISO on a USB stick and couldn't get to a prompt or desktop after ~10 minutes... Seems to not like my system - either the nVidia graphics card - or something else....
That's unfortunate. I don't have any faulty CPUs anymore to test it myself. But what you experience (script does not trigger the bug anymore) is what I fear with every change to the script.
If I may suggest, use Fedora as you would, go back to commit ac77b35195c45805ab1d795a13f012d19337ce44 (before moving from GCC 7.1.0 to 7.2.0) and run the script. This should generate the same erros as you were seeing before. Open the logs after it outputs an error and paste it here. We'll compare the error with the one I was experiencing and with the one reported in GCC's bug database.
I can't see any changes in the compilation procedure in the script that would cause it to fail - but changing the instances of 7.2.0 to 7.1.0 should have the same effect....
I just ran: s/7.2.0/7.1.0/g
Then changed the extension from .xz to .bz2...
It's running now as $ ./crc-kill-ryzen.sh 8 4
- where in my version I pretty much only change the download location....
EDIT: Seems every build failed after 580 or so seconds - so I figure theres something else going on.... Reverted to commit id ac77b35195c45805ab1d795a13f012d19337ce44 - will test that...
ok - so even now - I can't quite get it to fail the same way.....
I did end up getting the following failure - with just about all of the loops at the same time - no matter how many I tried:
/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/xgcc -B/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem /usr/local/x86_64-pc-linux-gnu/include -isystem /usr/local/x86_64-pc-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fpic -mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fpic -mlong-double-80 -DUSE_ELF_SYMVER -I. -I. -I../.././gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/. -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../include -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o unwind-dw2.o -MT unwind-dw2.o -MD -MP -MF unwind-dw2.dep -fexceptions -c /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c -fvisibility=hidden -DHIDE_EXPORTS
In file included from /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c:403:0:
./md-unwind-support.h: In function 'x86_64_fallback_frame_state':
./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext'
sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
^~
make[3]: *** [/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/shared-object.mk:14: unwind-dw2.o] Error 1
make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0/x86_64-pc-linux-gnu/libgcc'
make[2]: *** [Makefile:21950: all-stage1-target-libgcc] Error 2
make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make[1]: *** [Makefile:27079: stage1-bubble] Error 2
make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make: *** [Makefile:942: all] Error 2
From what I understand, this is the problem with v7.1.0 that is fixed in v7.2.0?
We don't know exactly what triggers the bug. We only know how to trigger it running certain software (in probably very specific version). Hence, changing anything (even the timezone, the room temperature, and certainly any software update or even the distribution) can result in missing the "desired" instruction pattern.
To have a "reproducible" outcome the bits really should be the same. I think I will need to revert everything to when it worked for me with a clean Ubuntu 17.04 live USB distro.
This script has only one purpose: Reliably trigger the bug! It does not really matter with which distribution or software or which version. However, for me it worked in a certain configuration which I cannot test anymore. So, without (massive) user input I only have one option: Revert!
No distribution other than Ubuntu 17.04. will be supported for now.
I'm not so sure its a distro problem - as even on Fedora 27 beta, I could hit a problem - and it was a segfault - but I'm wondering if it was a problem with gcc 7.1.0 all along - as the crash I see now was different - but even in the same version (ie the older commit), I can't reproduce the fault....
This actually makes me wonder if it was something different that fixed the problem.... If so, this would probably be a good thing for users.... Hmmmm...
I invested waaaay to much time into this already. I just wanted a fast, shiny, new CPU from AMD. From that point of view everything around this issue was and still is very disappointing.
I don't want to continue this semi-supported script without official word from AMD acknowledging they have a big f...... problem.
Please fork the code and I will add a link to it in the README.
@CRCinAU The error you encountered is the expected one arizing from the incompatibility between gcc 7.1.0 and glib 2.26. You could have catch a segfault along the way (it was happening on my setup) if the CPU was to hit the bug before it had reached the problematic gcc's code.
While I had my faulty CPU, on ArchLinux, I had hit the segfault here and there while using the original script (with gcc 7.1.0) even if most of the time I was ending with the gcc VS glib build error. I had also encountered segfault while building some AUR packages. The CPU bug can be triggered by any heavy load as far as I know. but the exact conditions to hit it are still unknown.
Since @suaefar don't want to spend more time on support his script outside of Ubuntu 17.04, I'll create a fork in the next few days. I'll be happy to support testers as far as I can, even without any faulty CPU at hand.
Also do you have your batch number easily available (its on the cpu, see this: https://www.reddit.com/r/Amd/comments/6scnlg/ryzen_reading_your_production_batch_number/)
I don't have it handy - however the other ones my retailer had in stock were 1705 - so I wouldn't be surprised if mine is around the same.... If I find my white tube of thermal goop somewhere, I'll get the watercooler off and take a look...
So my CPU used to crash all the time and be affected by this problem. The failures were under 10 minutes - with most happening in the first 2-3 minutes.
Fast forward to today, and I started the RMA process with AMD. They want a screenshot of what is happens when I run the test and it fails. But - One problem, I can't get my CPU to fail anymore.
I'm running Fedora 27, which has:
I've tried running with the args of
8 2
as well as4 4
or even8 4
- but nothing seems to cause the segfault.Gigabyte have a newish 'opcache control' on the AB350 Gaming 3 mainboard - and that doesn't seem to make any combination fail - I've tried the same tests for 2+ hours with it set to all three options - Auto, Enabled, Disabled.
So, has this been fixed in software somehow? Has gcc fixed the compiler? has something else happened?
Thoughts?