suaefar / ryzen-test

Tools to reproduce randomly crashing processes under load on AMD Ryzen processors on Linux
GNU General Public License v3.0
224 stars 59 forks source link

Does "build failed" mean anything? #29

Open ssaavedra opened 6 years ago

ssaavedra commented 6 years ago

Hi, I have just tested this tool against a Ryzen 1500X 1746PGS and it seems to segfault when using the default settings on a Asus Prime X370-Pro (BIOS v.4012).

However, it seems that if I go to the AMD CBS config in the BIOS and set the OpCode Optimization setting to Disabled I am able to run the tool without any [KERN] segfault warning.

However, I get a "TIME TO FAIL: 671 s" (and then later ones for the rest of the processes, next ones are 674s, 1101s, 1120s...) but the previous line just says "build failed". I went to /mnt/ramdisk/workdir/buildloop.d/loop-0/build.log and found out a gcc error:

/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/xgcc -B/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem /usr/local/x86_64-pc-linux-gnu/include -isystem /usr/local/x86_64-pc-linux-gnu/sys-include    -g -O2 -O2  -g -O2 -DIN_GCC    -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic -mlong-double-80 -DUSE_ELF_SYMVER -I. -I. -I../.././gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/. -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../include -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS  -DUSE_TLS -o unwind-dw2.o -MT unwind-dw2.o -MD -MP -MF unwind-dw2.dep -fexceptions -c /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c -fvisibility=hidden -DHIDE_EXPORTS
In file included from /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c:403:0:
./md-unwind-support.h: In function 'x86_64_fallback_frame_state':
./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext'
       sc = (struct sigcontext *) (void *) &uc_->uc_mcontext;
                                               ^~
make[3]: *** [/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/shared-object.mk:14: unwind-dw2.o] Error 1
make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0/x86_64-pc-linux-gnu/libgcc'
make[2]: *** [Makefile:21950: all-stage1-target-libgcc] Error 2
make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make[1]: *** [Makefile:27079: stage1-bubble] Error 2
make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0'
make: *** [Makefile:942: all] Error 2

Maybe we are building versions of gcc known not to work, but I am fearing whether the test's reproducitbility is jeopardized in any way.

drescherjm commented 6 years ago

When I tested my 2700 the default gcc version did not build on ubuntu-18.04. I had to use the arch pull request https://github.com/suaefar/ryzen-test/pull/26 and mod that back to use on ubuntu. John

On Wed, Aug 22, 2018 at 10:55 AM Santiago Saavedra notifications@github.com wrote:

Hi, I have just tested this tool against a Ryzen 1500X 1746PGS and it seems to segfault when using the default settings on a Asus Prime X370-Pro (BIOS v.4012).

However, it seems that if I go to the AMD CBS config in the BIOS and set the OpCode Optimization setting to Disabled I am able to run the tool without any [KERN] segfault warning.

However, I get a "TIME TO FAIL: 671 s" (and then later ones for the rest of the processes, next ones are 674s, 1101s, 1120s...) but the previous line just says "build failed". I went to /mnt/ramdisk/workdir/buildloop.d/loop-0/build.log and found out a gcc error:

/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/xgcc -B/mnt/ramdisk/workdir/buildloop.d/loop-0/./gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/ -B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem /usr/local/x86_64-pc-linux-gnu/include -isystem /usr/local/x86_64-pc-linux-gnu/sys-include -g -O2 -O2 -g -O2 -DIN_GCC -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./include -fpic -mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector -fpic -mlong-double-80 -DUSE_ELF_SYMVER -I. -I. -I../.././gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/. -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../gcc -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/../include -I/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT -DHAVE_CC_TLS -DUSE_TLS -o unwind-dw2.o -MT unwind-dw2.o -MD -MP -MF unwind-dw2.dep -fexceptions -c /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c -fvisibility=hidden -DHIDE_EXPORTS In file included from /mnt/ramdisk/workdir/gcc-7.1.0/libgcc/unwind-dw2.c:403:0: ./md-unwind-support.h: In function 'x86_64_fallback_framestate': ./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext' sc = (struct sigcontext ) (void ) &uc->uc_mcontext; ^~ make[3]: [/mnt/ramdisk/workdir/gcc-7.1.0/libgcc/shared-object.mk:14: unwind-dw2.o] Error 1 make[3]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0/x86_64-pc-linux-gnu/libgcc' make[2]: [Makefile:21950: all-stage1-target-libgcc] Error 2 make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0' make[1]: [Makefile:27079: stage1-bubble] Error 2 make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-0' make: [Makefile:942: all] Error 2

Maybe we are building versions of gcc known not to work, but I am fearing whether the test's reproducitbility is jeopardized in any way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/suaefar/ryzen-test/issues/29, or mute the thread https://github.com/notifications/unsubscribe-auth/AABvEupOoYGWuBBhxGtf8mOvFIMDuEdjks5uTXDkgaJpZM4WH07f .

-- John M. Drescher

m-r-s commented 6 years ago

The total issue is rather complex. AMD still did not release any information on this.

If you know how the expected output of a gcc build process looks like you should be able to figure out how to adapt this script in order to build another version of GCC with another version of the operating system than that I used to successfully identify erroneous CPUs.

If you don't know how the expected output of a gcc build process looks like you better follow the instructions, i.e., use the version of Ubuntu and GCC that I used.

Maybe you also find someone who confirms that the script is able to expose erroneous CPU with Ubuntu 18.04 and a newer version of GCC. I can't because I don't have any faulty CPUs anymore.

stolsvik commented 5 years ago

Also testing Ryzen, and got this. Apparently well-known for gcc6.4.0, but the strange thing is that I get this when compiling gcc 7.1.0 which is what the tool fetches. https://stackoverflow.com/questions/52498431/compile-gcc6-4-0-using-gcc8-1-1

.. it was fixed in 7.2.0. Currently running the tool with gcc source 8.3.0