Closed vondele closed 4 years ago
Thank you for testing c-chess-cli. Your analysis is correct. In detail, here's the logic:
So, under your testing conditions, that means (at least one of the 500 instances of) SF took more than 3s to respond "uciok". This is enormous, even for loading a 20MB file from disk. Do you really have 128 cores / 256 hyper-threads ?
Regarding the time losses you experience with 250 threads, which disappear when you share the parallelism with the OS (eg. 25 c-chess-cli instances with concurrency=10). Can you completely remove this loop in main.c:main()
?
// Main thread loop: check deadline overdue at regular intervals
do {
system_sleep(100);
for (int i = 0; i < options.concurrency; i++) {
const int64_t delay = deadline_overdue(&Workers[i]);
// We want some tolerance on small delays here. Given a choice, it's best to wait for
// the worker thread to notice an overdue deadline, which it will handled nicely by
// counting the game as lost for the offending engine, and continue. Enforcing deadlines
// from the master thread is the last resort solution, because it is an unrecovrable
// error. At this point we are likely to face a completely unresponsive engine, where
// any attempt at I/O will block the master thread, on top of the already blocked
// worker. Hence, we must DIE().
if (delay > 1000)
DIE("[%d] engine %s is unresponsive\n", i, Workers[i].deadline.engineName.buf);
}
} while (!job_queue_done(&jq));
This is the part that does the aformentionned deadline check. Every 0.1s, the master thread checks (under lock protection) 250 deadlines (one for each worker). Which could introduce some contention...
This is enormous, even for loading a 20MB file from disk. Do you really have 128 cores / 256 hyper-threads ?
yes 128/256 cores/threads, but note that the binary is on a network mounted disk (which is quite common on such hardware).
OK, I updated to current master (from the version I had before e5e24c4b255dcc498d18fa67c1ced47f7f54ab44 Thu Oct 22), and indeed there are no time losses in this round of testing (with or without the code in main removed). (neither at 2+0.02s nor at 1+0.01s). So either something changed in the code, or somehow the machine is in a healthier state (was rebooted, I see).
I had one match that failed to start ('[7] engine Stockfish 211020 is unresponsive' at 1+0.01) even though I have the 2000ms -> 20000ms in place.
OK, I updated to current master (from the version I had before e5e24c4 Thu Oct 22), and indeed there are no time losses in this round of testing (with or without the code in main removed). (neither at 2+0.02s nor at 1+0.01s). So either something changed in the code, or somehow the machine is in a healthier state (was rebooted, I see).
Was that with or without the 2s -> 20s change ?
That's with that change.
doubled timeout to 4s. that's 4+1=5s total. should be enough. i don't want to increase this more, as it's a bad trade-off between the normal use case and the special use case. in the normal use case, we want to get a quick feedback if the engine does not respond uciok, for example the user provided a CECP engine instead of an UCI one, etc.
OK, you want me to close the issue or do any further testing?
thanks. closing for now.
I'm going to take this question https://github.com/cutechess/cutechess/issues/630#issuecomment-720011788 to this issue tracker. Since that might be useful for future exchange.
I note one thing, I had to change the time-out to be able to run with large (250) concurrency:
without this change, c-chess-cli would stop saying that it failed to start the engines (forgot the precise error message). Given it is loading 500 20MB binaries from disk, that is possible I think.