[Ryzen] Crashes at framework

Vizvezdenec commented 6 years ago

Idk if I should create this issue on /fishtest, but w/e. In recent tests there are a lot of crashes, seems like some of the latest commits causes them. http://tests.stockfishchess.org/tests/view/5a9379630ebc590297cc89a4 http://tests.stockfishchess.org/tests/view/5a9379120ebc590297cc89a2 http://tests.stockfishchess.org/tests/view/5a93198b0ebc590297cc8942

noobpwnftw commented 6 years ago

I don't see any crash dumps but here are aggregated results from /var/log/messages from my workers.

errors.log

crossbr commented 6 years ago

My 3 core also crashed with a segfault, and normally it never crashes.

AndyGrant commented 6 years ago

That commit seems to be where the issue started, but can you identify why?

vondele commented 6 years ago

@noobpwnftw any chance you could resolve these addresses to a code line ?

FYI the test of the mergeCapture patch shows a rather normal crash pattern: http://tests.stockfishchess.org/tests/view/5a91a8bb0ebc590297cc875b which of course doesn't prove it correct.

AndyGrant commented 6 years ago

http://tests.stockfishchess.org/tests/view/5a923dc80ebc590297cc8806 This test has a fair number of crashes, and happens before the Vondele patch we are looking at.

Vizvezdenec commented 6 years ago

Issue is still there and it's really bad in some tests http://tests.stockfishchess.org/tests/view/5a9379630ebc590297cc89a4

Stefano80 commented 6 years ago

Did we understand whether it is a Stockfish or a fishtest problem?

Vizvezdenec commented 6 years ago

@Stefano80 I have 0 idea that's why I was doubting where to open this issue.

noobpwnftw commented 6 years ago

I suggest we pick one patch that crashes the most and rebase it against each of the recent master commit and find out. http://tests.stockfishchess.org/tests/view/5a93eee60ebc590297cc8a07 http://tests.stockfishchess.org/tests/view/5a93ee230ebc590297cc8a02 http://tests.stockfishchess.org/tests/view/5a93ecaa0ebc590297cc89ff

locutus2 commented 6 years ago

I started with a serious of self-play tests of the various master commits to isolate the possible source of the crashes. Auto purge is set off so crashes could not be removed and masked the results. The first two are: "Join all capture init stages in MovePicker" http://tests.stockfishchess.org/tests/view/5a9404f70ebc590297cc8a26

"More robust interaction of singular search and iid" http://tests.stockfishchess.org/tests/view/5a94056e0ebc590297cc8a28

Stefano80 commented 6 years ago

Danke Stefan!

noobpwnftw commented 6 years ago

http://tests.stockfishchess.org/tests/view/5a94056e0ebc590297cc8a28 This one was actually tested with crosscheck before, no crash occurred. This is one of them: http://tests.stockfishchess.org/tests/view/5a927b4a0ebc590297cc8833

However, I suspect it may occur more often with LTC and Hash=64. Let's see how it goes with STC first.

lp-- commented 6 years ago

I see zero crashes in my tests, and they touch only time management, so it is almost master. So it would be weird if they are coming from master in other tests

More likely some specific patches trigger execution of that part of the code from where crash is coming

locutus2 commented 6 years ago

@noobpwnftw Thanks, i checked the corssCheck tests and indeed the show no crashes. So i delete the one test and start the next.

Vizvezdenec commented 6 years ago

Seems like there are 0 crashes in sg tests also...

xoto10 commented 6 years ago

@snicolet If we're not sure if we can trust the framework at the moment, is this a good moment for you to release some of those pending tests of yours that you mentioned that are likely to fail? :-)

locutus2 commented 6 years ago

Additionally i will try a harder crash test on current master with debug mode on. With debug mode off errors can be hidden which doesn't gives a crash.

Vizvezdenec commented 6 years ago

Idk... 2 fresh patches from Andy Grant http://tests.stockfishchess.org/tests/view/5a943fa50ebc590297cc8a58 - flooded with crashes http://tests.stockfishchess.org/tests/view/5a9440f10ebc590297cc8a5a - 0 crashes Both based on the same master. I have no idea what can cause such behaviour, maybe it's a fishtest problem? @Stefano80

VoyagerOne commented 6 years ago

It's a framework issue... Too many requests... If you want to see crashes just change the priority to 0 for this tuning patch: http://tests.stockfishchess.org/tests/view/5a9200d10ebc590297cc87a3

vondele commented 6 years ago

I tested current master (71cc01c2ef45bab91bdcd72c699f66e19dd518b3) with 40'000 games locally, and there has not been a single crash. So, this suggests it is the (fishtest?) environment that is part of the problem. I'm somewhat surprised that this can lead to segfaulting SF binaries (see kernel messages from @noobpwnftw)...

lp-- commented 6 years ago

@vondele My guess is that you need to take patch which shows crashes and test it. As there are so many of such patches reason for crashes is not in specific patch but somewhere else in the code. Patches just allow to reach that place with specific parameters.

vondele commented 6 years ago

@lp-- the kernel messages show that it affects both base and patched binaries, and it is happening for a wide range of patches, so just testing master should be sufficient.

Having said that, to get more info we need to get a specific trace, ideally by having a binary compiled with -g (by providing the file fishtest/worker/custom_make.txt containing a line like

CXXFLAGS="-g" make build COMP=gcc ARCH=x86-64-modern

and using addr2line on that binary with the address obtained in the kernel logs (i.e. the number after ip 0000000000423d1d). Even more useful would be to enable core dumps (ulimit -c 10000000) before starting the fishtest worker.

Stefano80 commented 6 years ago

If it is a transient fishtest problem, then the crashing tests should be located temporally.

Now that we have start and test time recorded for all tests it should be checkable.

I will give it a try tomorrow.

AndyGrant commented 6 years ago

I cannot recreate crashes on my end, on any of my machines. In my opinion the issue must be with fishtest. I would recommend we stop submitting anything new, and finish clearing out all the stuff that is currently running, and the two dozen soon-to-be reds sitting at -1.

noobpwnftw commented 6 years ago

Apparently the book is downloaded only once and never checked for sanity, I found 2 of 100 workers had downloaded an incomplete book! (0-sized file) Now I have synchronized all of them with correct book file and cleared logs. Let's wait and see if any more crashes emerges after this. This may fix some of the crashes from my workers but I don't see how this could cause other workers to crash too and does not solve tuning crashes I presume.

AndyGrant commented 6 years ago

There is - or was - a thing where if a worker's task was ended (which I imagine happens in your bookless case) another worker could pick up the stats, and continue it. I suppose it is possible that your crashed workers had their results picked up by others, as you are the only person I have seen report seeing actual crashes.

If only 2 of your workers had this problem, then I am likely wrong. If you checked 8 and found 2 with errors.... I might be right.

noobpwnftw commented 6 years ago

Caught one: Feb 27 06:23:01 174 kernel: stockfish[18611]: segfault at 0 ip 0000000000424e60 sp 00007f104d41de70 error 4 in stockfish[400000+47000] Files are built at Feb 27 06:16 and collected at Feb 27 06:23 crashed_files.zip

Rocky640 commented 6 years ago

Your run_id.txt indicates 5a9480c70ebc590297cc8adc

So it seems that the Error 4 relates to the following test http://tests.stockfishchess.org/tests/view/5a9480c70ebc590297cc8adc which does not have any crashes !!!

Maybe the error 4 happened because the tests were finished. Looks like a false alarm...

noobpwnftw commented 6 years ago

There: Feb 27 07:20:01 193 kernel: base[2596]: segfault at ffff7076 ip 0000000000423d1d sp 00007fdf719d8640 error 6 in base[400000+47000] crash.zip

UPDATE: Ok guys, Threadripper processors unaffected by the segfault bug is obviously not true, now I have proof.

This particular commit forms a code pattern that will trigger the bug:

Where the crash occurred is a multi-byte "nop" alignment instruction after a jump table for the switch in MovePicker::next_move() function that should never access any memory, I have collected 20 crashes on different machines with the exact same location of segfault. Similar symptoms has been reported with https://github.com/suaefar/ryzen-test.

So it is not a stockfish issue nor fishtest issue, yet hard to reproduce even by testing against itself since it is a processor bug. I'm retiring this 100 workers and deploy another 100 with Intel chips, let's see if I'm wrong about this.

AndyGrant commented 6 years ago

So I think the conclussion is the bug in your CPUS, +

There is - or was - a thing where if a worker's task was ended (which I imagine happens in your bookless case) another worker could pick up the stats, and continue it. I suppose it is possible that your crashed workers had their results picked up by others, as you are the only person I have seen report seeing actual crashes.

Unless others are seeing crashes personally?

Vizvezdenec commented 6 years ago

there were crashes for other users too as you can see there: http://tests.stockfishchess.org/tests/view/5a9379630ebc590297cc89a4 Allnautilus, bcross, dew, solarlight and others - their machines usually don't crash. And yes, this are not "picked up" tasks since I remember this test running and it had Allonautilus from start, also you can see comment by @crossbr Also in your test http://tests.stockfishchess.org/tests/view/5a9379120ebc590297cc89a2 you can see 2 other workers crashing and they were running it from the start (since you stopped it after like 2 minutes or so).

noobpwnftw commented 6 years ago

As I said, with recent changes in SF, it happened to carry out the piece of code that will trigger the processor bug.

Now I have replaced with 100 Intel chip workers(ChessDB_CN_WorkerPool-8cores-xxx), and ran through the same test with zero crash now.

http://tests.stockfishchess.org/tests/view/5a94be3e0ebc590297cc8b16 vs http://tests.stockfishchess.org/tests/view/5a9379120ebc590297cc89a2

noobpwnftw commented 6 years ago

We did nothing wrong here and even if this has to be avoided on the software side, it must be done in the compiler so it won't produce bug triggering code, there seems to be no reliable way otherwise. Actually the processor bug can be triggered from any running process and cause random segfaults everywhere, it just so happens here because my worker machines run nothing else than fishtest.

vondele commented 6 years ago

@noobpwnftw ... great debugging.

I think the debugging was made more complicated due the to fishtest feature of 'reassigning' batches to different workers. This has come up in the past repeatedly, and I wonder if this can be avoided (@Stefano80).

Stefano80 commented 6 years ago

Yes, this should be changed. Could you open an issue on fishtest such that we do not forget this?

snicolet commented 6 years ago

@noobpwnftw Yes, impressive debugging, thanks.

Where the crash occurred is a multi-byte "nop" alignment instruction after a jump table for the switch in MovePicker::next_move() function that should never access any memory, I have collected 20 crashes on different machines with the exact same location of segfault. Similar symptoms has been reported with https://github.com/suaefar/ryzen-test.

Do we understand why the compiled binary on ARM has this bug in the switch statement in this commit? https://github.com/official-stockfish/Stockfish/commit/8dd6875240e05dbcc1fb6467ffb11ad360aa474c

As I understand it, you have removed the ARM machines from fishtest and put heaps of Intel machines instead, so this gives us a couple hours of relief for taking actions. We can either revert the patch, or rewrite it in a way that doesn't produce the bogus code pattern (my preference), but for that we must know exactly how and test again on ARM...

Thanks for the awesome work, anyway :-)

noobpwnftw commented 6 years ago

@snicolet The previous 100 workers are not ARM machines, they are AMD Ryzens(Threadripper).

This SF case may be the first "clean" confirmation of such bug exists beyond previously confirmed affected products.

Previous test scenario was to do some parallel GCC complications, it will segfault somewhere. AMD guys said that the cause is "complicated" and told everyone who was affected to do RMA. Discussions here: https://community.amd.com/thread/215773

However I suspect this might not entirely the same bug as before, because:

it only affected early Ryzens
all my machines passed https://github.com/suaefar/ryzen-test
Running fishtest actually does some GCC work like 2 and I don't see any of those segfaults from GCC

Unfortunately even for the previous bug there is no practical workaround other than replacing the physical processors. Code pattern like this looks very common, there is no guarantee that if we just revert the changes here and it won't present somewhere else.

Vizvezdenec commented 6 years ago

I think that we can close this issue and you can create a new one where you can discuss how to reformat code so it wouldn't crash sf on ryzen CPUs, is this ok? @snicolet

noobpwnftw commented 6 years ago

This is a standalone test package to reproduce the crash on Ryzen.

Binaries were built with devtoolset-7 GCC under CentOS7, corresponding source codes are also included, it will crash like this on Ryzen: kernel: stockfish[2896]: segfault at 7fa859390124 ip 0000000000426200 sp 00007fa81520d870 error 6 in stockfish[400000+45000]

Now it should be able produce a crash within minutes simply by running run.sh.

crash_test.zip

tomtor commented 6 years ago

@noobpwnftb I cannot reproduce the crash with your latest crash_test.zip on my Ryzen 1700x. The run.sh finished without issues.

It is the 7-core worker on fishtest and it never crashed on fishtest. I did have the RMA problem on an old 1700 which I replaced with this recent 1700x.

Are you sure it is cannot be kernel related? I am using 4.4.0-109-generic #132-Ubuntu SMP. Or perhaps it is specific to threadrippers?

vondele commented 6 years ago

If we want to try code level workarounds (really?).... Take 1: https://github.com/vondele/Stockfish/commit/1db66a753e828282de6c48feec1152621a08d43a

snicolet commented 6 years ago

@Vizvezdenec Let's keep this issue open for the moment.

At this moment I want to really understand, I want to be open and doubt about everything on the subject of these crashes.

I even want to be able to doubt that the crash is a direct consequence of the Movepicker patch, because as @AndyGrant noted in this comment: official-stockfish#1433 (comment) there were already some crashes on the Ryzen machines (most probably Ryzen, to be exact...) before the MovePicker patch.

Still, the crash report by @noobpwnftw shows a crash in the jump table instructions for MovePicker.cpp...

snicolet commented 6 years ago

Take 2: https://github.com/snicolet/Stockfish/commit/9c077ac5271b455627f53776deb66b672c4eeddf

Move the XXX_INIT stages at the end of the switch, and put the switch in an infinite loop to go to the next stages. This version is a simplification which gains 7 lines of code and is a tad faster than master, but that is not the question.

noobpwnftw commented 6 years ago

Although in my case most of the crashes are related to the MovePicker patch, I object to reverting it because it will only reduce the chances of having crashes for now, but not "solving" it, as @AndyGrant suggested with earlier tests. And it may come back again in the future.

It just so happened because we introduced crash triggering code sequences in a highly frequent execution path, thus on faulty Ryzens(including mine) we are more likely to encounter crashes.

@tomtor suggested that not every Ryzen chips are faulty, so we might just need to prepare the convincing test suite for the users proving the bug to AMD and have their faulty chips replaced.

snicolet commented 6 years ago

Take 3: https://github.com/snicolet/Stockfish/commit/32d42d7d8edf1df9d092890119907465adfea68d

Get rid of the four stages CAPTURES_INIT, PROBCUT_CAPTURES_INIT, QCAPTURES_INIT and QSEARCH_RECAPTURES_INIT altogether, inlining their computations at the beginning of their next stages using an helper function called "generate_and_score_captures()".

Maybe not elegant (I am not sure, there are arguments pros and cons), but at least the compiled binary has a simpler jump table...

Same speed as master.

noobpwnftw commented 6 years ago

@snicolet If we'd go this far maybe we should add an ARCH=x86-64-ryzen in Makefiles and fork the source code like:

#ifdef (RYZEN)

#endif

snicolet commented 6 years ago

I am coming back to the assembly display of the jump table given by @noobpwnftw (thanks!)

36705431-01f96306-1ba0-11e8-8131-865c5178dced

• The crash occurs indeed in a multi-byte NOP alignment instruction, probably 0f1f8000000000 according to this thread: https://stackoverflow.com/questions/25545470/long-multi-byte-nops-commonly-understood-macros-or-other-notation

These are seven bytes of code emitted at the end of the first cases of the switch statement...

  case MAIN_SEARCH: case EVASION: case QSEARCH: case PROBCUT:
      ++stage;
      return ttMove;

to align the beginning of the block handling the next stages (the new four stages collapsed by the patch) on 16 bytes boundary:

  case CAPTURES_INIT:
  case PROBCUT_CAPTURES_INIT:
  case QCAPTURES_INIT:
  case QSEARCH_RECAPTURES:
      endBadCaptures = cur = moves;
      endMoves = generate<CAPTURES>(pos, cur);
     ....

This is why I have tried in my two versions of workarounds (take 2 and take 3) to reorder, shake or shrink the jump table, hoping that the alignment instructions would disappear naturally by reformatting the C++ code. Of course we should check the generated code by hand to be sure that the NOP DWORD ptr [EAX + 00000000H] is no longer there.

• Is there a way to force GCC, via a compiler flag, to not emit padding nops in a switch table? We probably want to keep function alignment and loop alignment, and selectively remove that switch alignment (anyway there are very few jump tables in Stockfish code). That would maybe, or maybe not, be a slow-down, the effect of byte alignment is very variable in modern processors.

noobpwnftw commented 6 years ago

@snicolet Some updates: it is not the multi-byte NOP instruction itself that caused the crash. because in other builds it crashed in the middle of some MOV instructions, it is more likely a bug in the processors' branch predictor while handling complex switch tables, I think our next_move() logic looks too fancy.

mcostalba commented 6 years ago

Do you really think we have to address compiler/processor bugs in SF sources?

snicolet commented 6 years ago

In general, no. In this particular case I think we should if we can, yes.

For two reasons:

a) it will be very bad publicity for the project if Stockfish stops working on Ryzen.

For instance, SF8 was used as benchmark for speed by some hardware sites when Ryzen came out: https://www.hardware.fr/articles/970-1/intel-core-i7-8700k-core-i5-8600k-core-i5-8400-core-i3-8350k-test.html , linked to in ChessBase: https://en.chessbase.com/post/amd-releases-new-ryzen-processor. I just don't want their next article to begin with "last year we used Stockfish 8 and Komodo for testing the speed of chess IA on Ryzen, but the last version of Stockfish crashes randomly so this year we only publish results for Komodo."

b) the 1600 Ryzen cores that @noobpwnftw kindly provided for fishtest really boosted the speed of tests while they were there. Now he has replaced them by 800 Intel cores, which are nice too :-) but still gives fishtest 50% less global power. Having the 1600 cores back would definitely accelerate SF development.

official-stockfish / Stockfish

[Ryzen] Crashes at framework #1433