official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.45k stars 2.26k forks source link

Weird nps drop #2508

Closed NKONSTANTAKIS closed 4 years ago

NKONSTANTAKIS commented 4 years ago

At CCC (90 threads, 16Gb hash, 1+1 TC, Ct=0), in every game SF starts with about 80Mnps on move 1, then dropping to as low as 3Mnps (~1/30) for a few moves. This slowdown affects moves 2-7, then it reaches average max at ~100Mnps. Is this normal? Can we reproduce at high threaded machine?

I see now that in some games the drop affects just 1-2 moves, so I recon it needs time to get the hash working. So when it takes more time to play a move then it can reach max nps on the next move, but when not it can be slow for a series of moves, missing critical continuations on book exit, like game 20, a KGA forced win that SF misplayed from the start opting for a drawn line.

I remember that K had always a similar issue and Mark mitigated it by forcing it to use more time at first move(s).

NKONSTANTAKIS commented 4 years ago

I tracked down archive games and spotted this behavior for almost a year back, but it was less frequent, many games reached max speed on move 2 (no dip). All 1+1, Hash 16Gb. Maybe too much for this TC?

NKONSTANTAKIS commented 4 years ago

It seems that this the nps drop happens more severely and for more moves when SF gets an unexpected reply, because the previous hash buildup becomes largely obsolete. When SF gets an expected reply, it can often reach max nps from move 2, but when it gets repeatedly surprised it can struggle with up to ~1/20 of nps for 2-3 moves in a row, resulting into very weak early game.

Note that in self-play this is a much rarer occurrence, as its the same engine producing very similar search tree. Possibly the elo gain of some of the tricks we do (like researching same depth, or lowering main thread depth on FH, and others) are due to further hash buildup on the same base. This time-economy and usability can be wasted with an unexpected move, but works due to self-play bias.

Its like further exploring your neighborhood that you have already explored, because exploring an unknown neighborhood is an expensive task, and benefiting because the usual battleground is your own area!

Naturally unexpected weak moves are not a problem, but NN's routinely produce unexpected superior moves and SF is more clueless than it could be due to this self-play sterile optimization. A solution for this would be testing vs NN, probably beneficial even at 1/4 of game production (cpu). With GPUs of course and more than 1/2 productivity it would be ideal, but with big framework implementation challenges.

All in all, those issues tend to diminish with higher TC's so no huge concerns are justified.

NKONSTANTAKIS commented 4 years ago

Besides hash buildup, another explanation could be that the aspiration window early on (when hash is not grown yet) struggles with some variations much more than others, and the nps drop is a result of many FH and FL. But still its not a convincing explanation for such a severe slowdown, so probably the root cause is some conditional weird interaction of our techniques.

In link is the earliest occurence of 1+1 CCC tournaments, from last June. This slowdown behavior was still there, seemingly somewhat less severe. And apparently evident in every single 1+1 event, strange how it passed unnoticed.

https://www.chess.com/computer-chess-championship#event=ccc-9-the-gauntlet-semifinals-test&game=13

NKONSTANTAKIS commented 4 years ago

Regarding hash size, in current match vs Scorpio it was lowered to 8Gb as requested. This seems to have a speedup effect both at move 1 (~80Mnps vs ~50Mnps for 16Gb) and at recovering from the dip, which remains.

A direct game per game comparison of this match to Leela's, shows that the dip pattern is closely related to the position, but occasionally immensely relieved by the smaller cache. All in all a clearly superior setting. Next test will be with 4Gb.

vondele commented 4 years ago

Based on the log provided by @crossbr one can exactly replay the game (i.e. send same sequence o UCI commands). I use the same Hash (16Gb) and threads (90). I made one exception, no table bases, as they are not on that machine. This shows no drop locally:

replay

so, I don't think it is related to the position, or the opponent moves.

NKONSTANTAKIS commented 4 years ago

That's great, it means SF has no universal problem. But on their specific HW, which is slower than yours, there is a distinct yet not absolute relation to the position.

CPUs: 2 x Intel Xeon Platinum 8168 @ 2.70 GHz 33 MB L3 Threads: 90 threads with HT on 16384 MB hash RAM: 256GB DDR4-2666 ECC Registered RDIMM SSD: 2x Crucial MX300 (1TB) in RAID1 OS: CentOS 7

All these indicate a memory bottleneck/latency issue, which happens more in cases of high rate of hash expansion/alteration (pressuring RAM read/write to the limits). This would explain the relation of the dip with unexpected moves, as well as the considerable relief with 8Gb. Further indication of this is that in the match vs ethereal the dip occured at just ~1/3 of the games. (due to ethereal producing more moves inside the expectations, preserving hash buildup)

I will soon ask for some tests to affirm this.

vondele commented 4 years ago

Looking more closely at the original logs, there are no tbhits in the first few moves, so that kind of excludes tablebase access as a source of slowing down things.

I don't think there is a memory bottleneck/latency issue related to hash usage in SF, no amount of TT writes/reads would slowdown SF by ~10x. Rather some other running process interfering on that machine.

NKONSTANTAKIS commented 4 years ago

I had side by side 3 CCC windows of 3 matches: with 16GB vs Leela, 16Gb vs Ethereal and 8 GB vs Scorpio. By clicking on every game 1 by 1 and comparing speed graph, the same openings had very high similarity of dip pattern. 8Gb alleviated completely the symptom in some cases, mildening it in others. Vs Ethereal the dip had much lower % of occurence.

So other interfering processes if they were hindering SF they should have hindered ethereal or other AB engines too. But that was never the case, only SF has been hindered in every 1+1 event since June. Komodo too was showing its typical behavior of very slow nps start (slowest at move 1 but gradually rising nps), which hinders its bullet performance.

NKONSTANTAKIS commented 4 years ago

@vondele For trying harder to induce it locally, how about more hash and/or lower TC, as with higher than 1+1 CCC games it wasn't observed.

I asked greco for 2 short tests:

  1. Without TBs, in order to easily exclude them.
  2. Test Cfish, as in case it doesn't dip we narrow down to its fewish differences with SF.
vondele commented 4 years ago

replay_hash

same replay with 250 threads, and 128Gb hash...

Vizvezdenec commented 4 years ago

Then maybe it's tbs who are for some (unknown) reason responsible for this?

Tilps commented 4 years ago

I was trying to think some way that partial 7 piece TB could be the problem - because its only partial if there are hits against the missing files, there won't be any TB hits, but somehow its still slow? Seems a bit unreasonable of an explanation at first glance.

vondele commented 4 years ago

replay_hash_tb

with added 6men TB (on slow storage)

NKONSTANTAKIS commented 4 years ago

Digging old stuff, I found this description of a severe slowdown related to 7-men TB (which CCC uses) and problematic OS paging when under memory pressure. The comments of syzygy1 are noteworthy. https://github.com/official-stockfish/Stockfish/issues/1789 And the solution: https://github.com/official-stockfish/Stockfish/pull/1790 Are we totally clean here?

That could also explain:

  1. The initialization of the dip: move 1 is fast due to low TB hits but as they rise OS sucks up resources trying to increase performance
  2. The positional relation: As the %increase on move-by-move TBhits differs from position to position, so will the OS reaction

This could also be related to CCC's SSD raid configuration. Those things by trying to ensure maximum performance often abuse system resources. That would also further explain why the dip is getting worse over the months: due to SF pruning more and reaching more TBhits early & suddenly.

Edit: TB's off, threads=1 dip remains. Hash=0 no dip

skiminki commented 4 years ago

It is also possible that deferred transparent huge pages defragment is the cause of the initial dip (depends on CCC server system settings). If this is the case, then SF-dev of today might already fix this issue, since now we ask huge pages on TT hash alloc with the madvise() call, and that should do all the defragmenting before move 1. This was committed yesterday. https://github.com/official-stockfish/Stockfish/commit/39437f4e55aaa26ef9f0d5a1c762e560e9ffde32

skiminki commented 4 years ago

This should tell whether deferred huge page defragment is in use:

cat /sys/kernel/mm/transparent_hugepage/defrag

vondele commented 4 years ago

@skiminki I'm wondering about your statement. Since we explicitly memset (not calloc) the TT, prior to search, I would expect that this defragmenting would not be an issue.... but maybe that's not right, what do you think?

skiminki commented 4 years ago

@vondele Hard to say for sure whether this is the issue here, but I'd consider this to match the symptoms. So at least this is a possibility (depends on what mode defrag has).

https://www.kernel.org/doc/Documentation/vm/transhuge.txt says:

"defer" means that an application will wake kswapd in the background to reclaim pages and wake kcompactd to compact memory so that THP is available in the near future. It's the responsibility of khugepaged to then install the THP pages later.

I would interpret this such that memset() touching the pages the first time would trigger physical memory defrag, and kcompactd/khugepaged would then later transform the small pages to large pages. Later = during the first moves.

This mode would have the benefit of being faster to launch the application than "defrag=always". But this comes at the cost of the application being significantly slower during the a warm-up due to physical page migration (= essentially similar to page swap out+in)

vondele commented 4 years ago

OK, let's see what the testing of current master yields.

It would be useful indeed to see output of cat /sys/kernel/mm/transparent_hugepage/defrag on that machine.

vondele commented 4 years ago

is there any update on this after master switched to large pages ?

NKONSTANTAKIS commented 4 years ago

Nope, next 1+1 event will show. If one wants to find out earlier he can ask for a test.

skiminki commented 4 years ago

TBH, this feels like a server configuration issue rather than a SF bug. TCEC does not encounter these issues on Linux, and they're running with more threads and with bigger hashes. They're also running SF on as little as 1.8s + sub-sec increment TC in testing, so this low nps dip would get noticed.

To determine whether THP may be the issue, we need the output of these commands:

cat /sys/kernel/mm/transparent_hugepage/enabled

cat /sys/kernel/mm/transparent_hugepage/defrag

Depending on the settings, transparent hugepages will try to compactify all process memory (not just the hash) in the background, and there is still a number of possibilities where khugepaged and related may cause havoc if configured incorrectly. For example:

The very simple experiment to determine whether THP config is to blame is to run the following commands:

echo never >/sys/kernel/mm/transparent_hugepage/enabled

echo never >/sys/kernel/mm/transparent_hugepage/defrag

and see whether the slowdown is gone.

It is also a possibility that this slowdown is completely unrelated to THP, so it would be useful to try to rule things out.

I've also recently inspected what SF does on init (for launch time optimization), and I haven't found anything that would explain the reported slowdown.

I remember that K had always a similar issue and Mark mitigated it by forcing it to use more time at first move(s).

This was due to Komodo having a particular bug. The bug was a global counter that counted the number of occupied hash entries, incremented on zeroed entries written the first time. On a high-thread NUMA machines, such global counter becomes a huge bottleneck. This explains why K gradually got faster nps when the hash became more utilized (less counter increments).

But this issue has nothing to do with SF and most other engines that do not use such a counter. Komodo doesn't have that bottleneck anymore, either.

vondele commented 4 years ago

looks like this is still happening at CCC in their tournament using 140220 64 BMI2. I guess that can only be resolved with the help of the sysadmin over there.

NKONSTANTAKIS commented 4 years ago

And Komodo is even more affected, starting with as low as 0.5Mnps for a couple of moves and slowly rising. But Houdini and Fire are fine...such a mystery.

skiminki commented 4 years ago

Is CCC doing something to resolve this issue?

Here's another knob to try:

echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page

vondele commented 4 years ago

I'll close this, since there is no new information, and in all likelihood this is not related to SF. If new info appears that suggests otherwise, please reopen.

jessegersensonchess commented 4 years ago

@vondele I'm the CCC sysadmin. Is there anything you'd like me to try?

vondele commented 4 years ago

@jessegersensonchess so I assume the problem still persists?

My guess would be it has something to do how the system deals with memory being provided to the process, e.g. freeing pages/going in swap etc.

I think first things to try would be the suggestions by @skiminki namely output of

cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag

afterwards see if

echo never >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/defrag
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page

make a difference

jessegersensonchess commented 4 years ago

swap is disabled on this host:

/data/engines]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         10G        8.3G        914M        232G        238G
Swap:            0B          0B          0B

cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never

cat /sys/kernel/mm/transparent_hugepage/defrag [always] madvise never

jessegersensonchess commented 4 years ago

@NKONSTANTAKIS The issue with the slow Komodo start was reviewed recently. The cause was a bug in Komodo, related to the order in which UCI parameters were being set.

vondele commented 4 years ago

I'm not excluding bugs, but the graphs above are an exact replay of the UCI commands that were send (from the logs) and didn't show that problem. Maybe would be good if we confirm that the problem exists if we send the same UCI command sequence. I'll attach here, and one should be able to run it as ./stockfish < ccc.inp.txt > out ccc.inp.txt

noobpwnftw commented 4 years ago

Xeon Scalables(models without M) AFAIK is known to have poor NUMA bandwidth, to workaround that, you might want 1) zone_reclaim_mode=0 2) numactl --interleave=all so that you don't evict pages on one node then the kernel trying to move them around. With interleaving the kernel might migrate threads around instead of moving memory pages which would be less performance damaging during the process.