lichess-org / fishnet

Distributed Stockfish analysis for lichess.org
https://lichess.org/get-fishnet
GNU General Public License v3.0
738 stars 102 forks source link

fairy-stockfish timeouts in 2.8.1 #254

Closed Stecors closed 10 months ago

Stecors commented 11 months ago

Since the 2.8.1 update, I have been seeing occasional worker crashes. That did not happen on 2.7.1, which I had been running 24/7 on a server for weeks.

Examples:

2024-01-05 21:49:00 W: Fairy-Stockfish timed out in worker 2. 2024-01-05 21:49:02 W: Fairy-Stockfish timed out in worker 3. 2024-01-05 21:49:02 W: Fairy-Stockfish timed out in worker 0. 2024-01-05 21:49:26 W: Fairy-Stockfish timed out in worker 1.

arch: x86_64-unknown-linux-musl The same error occurs with the new parameter --cpu-priority unchanged.

niklasf commented 10 months ago

Hi, thanks for reporting. Can you please try the current development version (or binary snapshots from https://github.com/lichess-org/fishnet/actions/runs/7432495919) to see if 9f1a11097cb6ede7ea21e6975bc39e2b690467f9 fixes the issue?

CarsonV commented 10 months ago

Had the same issue on windows. Seems to be working better with 9f1a110 but experiencing what seems to be lower nodes per sec sitting closer to 4-6k nps from before 2.8 changes from 7-10knps 5800x cpu

Stecors commented 10 months ago

I have let 2.8.2-dev run overnight. Even though there were only a handful of fairy-stockfish jobs, I haven't seen any timeouts anymore. Thanks for the quick fix.

niklasf commented 10 months ago

Thank you both.

For nps, since it is measured as the nodes of real positions (excluding the newly introduced chunk overlap) divided by the total time taken for the whole batch, a ~20% drop is expected. The degree of parallelism also varies much more, now, so there's more variance in this measurement. We could measure something smoother like nodes per CPU time, but ultimately wall clock time is what's relevant for the user experience.