official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
281 stars 130 forks source link

upload_pgn pserve and primary pserve can hang #2094

Open vondele opened 4 months ago

vondele commented 4 months ago

Now observed repeated under load that that these two pserves hang simultaneously under load. Since it always involved the upload_pgn api this maybe the place to start looking?

vdbergh commented 4 months ago

Did you confirm that the hang of the primary pserve is not a deadlock inside Fishtest?

I am quite sure it is not since dead tasks were properly cleared (indicating that the task thread was running normally).

vdbergh commented 4 months ago

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

vdbergh commented 4 months ago

One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.

This seems not true. I just tried it and not much more information is printed.

vondele commented 4 months ago

we have relatively little info, in that the tracing doesn't give useful info. mongo seems responsive. It always involves upload_pgn, so that would be a logical first place to start looking IMO.

vdbergh commented 4 months ago

I guess upload_pgn is the only api that sends a substantial amount of data from the client to the server.

vondele commented 4 months ago

it is also probably one of the most frequent calls (on the non-primary pserves)

vdbergh commented 4 months ago

We could do some instrumentation of the upload_pgn api (using #2067 we could even create a specific logger for this).

vondele commented 4 months ago

I find a couple of suspicious errors in the logs:

Jun 30 12:18:21 ns304014 pserve[2691]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 13:52:45 ns304014 pserve[6871]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:03:07 ns304014 pserve[14360]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:37:17 ns304014 pserve[16530]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:23:45 ns304014 pserve[18385]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:53:09 ns304014 pserve[33684]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:44:11 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:59:31 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 18:34:56 ns304014 pserve[56080]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 20:18:10 ns304014 pserve[929]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:39:04 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:53:33 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 23:11:13 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:11:28 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:56:47 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
vondele commented 4 months ago

Seems always at the same time as Validate_data_structures (but not necessarily related to the hang)

Jun 30 19:03:16 ns304014 pserve[36228]: Validate_data_structures: validating Fishtest's internal data structures...
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
vdbergh commented 4 months ago

Thanks. I guess I either need to make a copy of those dicts before iterating over them, or else wrap them in an appropriate lock.

vdbergh commented 4 months ago

I found out how to log waitress connections. From https://docs.pylonsproject.org/projects/pyramid/en/main/narr/logging.html. production.ini has to be modified as follows:

[app:fishtest]
use = egg:fishtest_server

pyramid.reload_templates = false
pyramid.debug_authorization = false
pyramid.debug_notfound = false
pyramid.debug_routematch = false
pyramid.default_locale_name = en

mako.directories = fishtest:templates

fishtest.port = %(http_port)s
fishtest.primary_port = 6543

[filter:translogger]
use = egg:Paste#translogger                                                                                                                                                                                 
setup_console_handler = False

[pipeline:main]
pipeline = translogger
           fishtest

Output looks like this

Starting server in PID 1239.
2024-07-02 05:38:25,985 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:25 +0000] "GET /tests HTTP/1.0" 200 86132 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:30,877 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:30 +0000] "GET /login HTTP/1.0" 200 18819 "https://3.82.105.159/tests" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:38,208 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:38 +0000] "GET /api/active_runs HTTP/1.0" 200 9949 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,401 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "POST /login HTTP/1.0" 302 169 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,513 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET / HTTP/1.0" 302 194 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,635 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET /tests HTTP/1.0" 200 87702 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
Failed_task: failure for: https://tests.stockfishchess.org/tests/view/66824033546a9b654e82da4b, task_id: 17, worker: user01-1cores-deruyts-81da, reason: 'Terminated by signal SIGINT'
2024-07-02 05:38:55,412 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:55 +0000] "POST /api/failed_task HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:38:57,699 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:56 +0000] "POST /api/upload_pgn HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:39:01,916 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:39:01 +0000] "POST /api/request_version HTTP/1.0" 200 36 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,269 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,885 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:03,500 INFO  [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:39:03 +0000] "POST /api/request_task HTTP/1.0" 200 1241 "-" "python-requests/2.25.1"
Validate_data_structures: validating Fishtest's internal data structures...
Validate_random_run: validated cache run 6680543e4eb1934d34b5aba5...
Clean_wtt_map: 1 active workers...
2024-07-02 05:41:02,901 INFO  [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:41:02 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
Clean_wtt_map: 1 active workers...
The run object 66823f9c546a9b654e82da47 does not validate: run['results_info'] is not in the schema
2024-07-02 05:43:03,325 INFO  [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:43:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:45:03,857 INFO  [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:45:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
vondele commented 4 months ago

I have been looking at upload_pgn code, and in principle this looks 'really simple'. What I don't figure out is how the the communication between the two pserves actually happens, and in particular if it could be just waiting forever (i.e. no timeout) in case some error happens there.

vdbergh commented 4 months ago

Hmm. Without looking at the source code, off hand I would say there is no communication between the two pserves. We have been experimenting with an api where a secondary instance asks the primary instance for some information but this is currently on hold.

vondele commented 4 months ago

right... so the only 'communication' is through mongodb in some sense.

vdbergh commented 4 months ago

During a fleet visit we should probably turn on access logging for waitress (see https://github.com/official-stockfish/fishtest/issues/2094#issuecomment-2201994277 ). If then pserve hangs (and it really seems to be a pserve hang, and not fishtest) we would know which api was served last, and by which thread.