time losses at high concurrency

vondele commented 4 years ago

I'm going to take this question https://github.com/cutechess/cutechess/issues/630#issuecomment-720011788 to this issue tracker. Since that might be useful for future exchange.

I note one thing, I had to change the time-out to be able to run with large (250) concurrency:

diff --git a/src/engine.c b/src/engine.c
index 09da452..998a2ad 100644
--- a/src/engine.c
+++ b/src/engine.c
@@ -124,7 +124,7 @@ Engine engine_new(const char *cmd, const char *name, const str_t *options, Deadl
     vec_del_rec(args, str_del);

     // Start the uci..uciok dialogue
-    deadline_set(deadline, &e, system_msec() + 2000);
+    deadline_set(deadline, &e, system_msec() + 20000);
     engine_writeln(&e, "uci");
     scope(str_del) str_t line = str_new();

without this change, c-chess-cli would stop saying that it failed to start the engines (forgot the precise error message). Given it is loading 500 20MB binaries from disk, that is possible I think.

lucasart commented 4 years ago

Thank you for testing c-chess-cli. Your analysis is correct. In detail, here's the logic:

worker thread sends "uci", sets a deadline (2s), and starts reading (using standard blocking I/O functions).
master thread monitors deadlines (one per worker), at regular intervals (0.1s). When it sees a deadline overshoot by a large margin (1s), it aborts with error message "engine is unresponsive".

So, under your testing conditions, that means (at least one of the 500 instances of) SF took more than 3s to respond "uciok". This is enormous, even for loading a 20MB file from disk. Do you really have 128 cores / 256 hyper-threads ?

lucasart commented 4 years ago

Regarding the time losses you experience with 250 threads, which disappear when you share the parallelism with the OS (eg. 25 c-chess-cli instances with concurrency=10). Can you completely remove this loop in main.c:main() ?

    // Main thread loop: check deadline overdue at regular intervals
    do {
        system_sleep(100);

        for (int i = 0; i < options.concurrency; i++) {
            const int64_t delay = deadline_overdue(&Workers[i]);

            // We want some tolerance on small delays here. Given a choice, it's best to wait for
            // the worker thread to notice an overdue deadline, which it will handled nicely by
            // counting the game as lost for the offending engine, and continue. Enforcing deadlines
            // from the master thread is the last resort solution, because it is an unrecovrable
            // error. At this point we are likely to face a completely unresponsive engine, where
            // any attempt at I/O will block the master thread, on top of the already blocked
            // worker. Hence, we must DIE().
            if (delay > 1000)
                DIE("[%d] engine %s is unresponsive\n", i, Workers[i].deadline.engineName.buf);
        }
    } while (!job_queue_done(&jq));

This is the part that does the aformentionned deadline check. Every 0.1s, the master thread checks (under lock protection) 250 deadlines (one for each worker). Which could introduce some contention...

vondele commented 4 years ago

This is enormous, even for loading a 20MB file from disk. Do you really have 128 cores / 256 hyper-threads ?

yes 128/256 cores/threads, but note that the binary is on a network mounted disk (which is quite common on such hardware).

vondele commented 4 years ago

OK, I updated to current master (from the version I had before e5e24c4b255dcc498d18fa67c1ced47f7f54ab44 Thu Oct 22), and indeed there are no time losses in this round of testing (with or without the code in main removed). (neither at 2+0.02s nor at 1+0.01s). So either something changed in the code, or somehow the machine is in a healthier state (was rebooted, I see).

I had one match that failed to start ('[7] engine Stockfish 211020 is unresponsive' at 1+0.01) even though I have the 2000ms -> 20000ms in place.

lucasart commented 4 years ago

OK, I updated to current master (from the version I had before e5e24c4 Thu Oct 22), and indeed there are no time losses in this round of testing (with or without the code in main removed). (neither at 2+0.02s nor at 1+0.01s). So either something changed in the code, or somehow the machine is in a healthier state (was rebooted, I see).

Was that with or without the 2s -> 20s change ?

vondele commented 4 years ago

That's with that change.

lucasart commented 4 years ago

doubled timeout to 4s. that's 4+1=5s total. should be enough. i don't want to increase this more, as it's a bad trade-off between the normal use case and the special use case. in the normal use case, we want to get a quick feedback if the engine does not respond uciok, for example the user provided a CECP engine instead of an UCI one, etc.

vondele commented 4 years ago

OK, you want me to close the issue or do any further testing?

lucasart commented 4 years ago

thanks. closing for now.

lucasart / c-chess-cli

time losses at high concurrency #22