Open vondele opened 4 months ago
Did you confirm that the hang of the primary pserve is not a deadlock inside Fishtest?
I am quite sure it is not since dead tasks were properly cleared (indicating that the task thread was running normally).
One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.
One idea I had is that during a fleet visit we may run the root logger with debug level. I think that then waitress will print much more information on what's going on.
This seems not true. I just tried it and not much more information is printed.
we have relatively little info, in that the tracing doesn't give useful info. mongo seems responsive. It always involves upload_pgn, so that would be a logical first place to start looking IMO.
I guess upload_pgn is the only api that sends a substantial amount of data from the client to the server.
it is also probably one of the most frequent calls (on the non-primary pserves)
We could do some instrumentation of the upload_pgn api (using #2067 we could even create a specific logger for this).
I find a couple of suspicious errors in the logs:
Jun 30 12:18:21 ns304014 pserve[2691]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 13:52:45 ns304014 pserve[6871]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:03:07 ns304014 pserve[14360]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 16:37:17 ns304014 pserve[16530]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:23:45 ns304014 pserve[18385]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 18:53:09 ns304014 pserve[33684]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:44:11 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jun 30 19:59:31 ns304014 pserve[39924]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 18:34:56 ns304014 pserve[56080]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 20:18:10 ns304014 pserve[929]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:39:04 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 21:53:33 ns304014 pserve[14959]: RuntimeError while executing task: dictionary changed size during iteration
Jul 01 23:11:13 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:11:28 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Jul 02 00:56:47 ns304014 pserve[28507]: RuntimeError while executing task: dictionary changed size during iteration
Seems always at the same time as Validate_data_structures (but not necessarily related to the hang)
Jun 30 19:03:16 ns304014 pserve[36228]: Validate_data_structures: validating Fishtest's internal data structures...
Jun 30 19:03:16 ns304014 pserve[36228]: RuntimeError while executing task: dictionary changed size during iteration
Thanks. I guess I either need to make a copy of those dicts before iterating over them, or else wrap them in an appropriate lock.
I found out how to log waitress connections. From https://docs.pylonsproject.org/projects/pyramid/en/main/narr/logging.html. production.ini has to be modified as follows:
[app:fishtest]
use = egg:fishtest_server
pyramid.reload_templates = false
pyramid.debug_authorization = false
pyramid.debug_notfound = false
pyramid.debug_routematch = false
pyramid.default_locale_name = en
mako.directories = fishtest:templates
fishtest.port = %(http_port)s
fishtest.primary_port = 6543
[filter:translogger]
use = egg:Paste#translogger
setup_console_handler = False
[pipeline:main]
pipeline = translogger
fishtest
Output looks like this
Starting server in PID 1239.
2024-07-02 05:38:25,985 INFO [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:25 +0000] "GET /tests HTTP/1.0" 200 86132 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:30,877 INFO [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:30 +0000] "GET /login HTTP/1.0" 200 18819 "https://3.82.105.159/tests" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:38,208 INFO [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:38 +0000] "GET /api/active_runs HTTP/1.0" 200 9949 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,401 INFO [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "POST /login HTTP/1.0" 302 169 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,513 INFO [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET / HTTP/1.0" 302 194 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
2024-07-02 05:38:39,635 INFO [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:38:39 +0000] "GET /tests HTTP/1.0" 200 87702 "https://3.82.105.159/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0"
Failed_task: failure for: https://tests.stockfishchess.org/tests/view/66824033546a9b654e82da4b, task_id: 17, worker: user01-1cores-deruyts-81da, reason: 'Terminated by signal SIGINT'
2024-07-02 05:38:55,412 INFO [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:38:55 +0000] "POST /api/failed_task HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:38:57,699 INFO [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:38:56 +0000] "POST /api/upload_pgn HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:39:01,916 INFO [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:39:01 +0000] "POST /api/request_version HTTP/1.0" 200 36 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,269 INFO [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:02,885 INFO [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:39:02 +0000] "POST /api/request_version HTTP/1.0" 200 38 "-" "python-requests/2.25.1"
2024-07-02 05:39:03,500 INFO [wsgi][waitress-3] 84.195.131.98 - - [02/Jul/2024:05:39:03 +0000] "POST /api/request_task HTTP/1.0" 200 1241 "-" "python-requests/2.25.1"
Validate_data_structures: validating Fishtest's internal data structures...
Validate_random_run: validated cache run 6680543e4eb1934d34b5aba5...
Clean_wtt_map: 1 active workers...
2024-07-02 05:41:02,901 INFO [wsgi][waitress-0] 84.195.131.98 - - [02/Jul/2024:05:41:02 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
Clean_wtt_map: 1 active workers...
The run object 66823f9c546a9b654e82da47 does not validate: run['results_info'] is not in the schema
2024-07-02 05:43:03,325 INFO [wsgi][waitress-1] 84.195.131.98 - - [02/Jul/2024:05:43:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
2024-07-02 05:45:03,857 INFO [wsgi][waitress-2] 84.195.131.98 - - [02/Jul/2024:05:45:03 +0000] "POST /api/beat HTTP/1.0" 200 22 "-" "python-requests/2.25.1"
I have been looking at upload_pgn code, and in principle this looks 'really simple'. What I don't figure out is how the the communication between the two pserves actually happens, and in particular if it could be just waiting forever (i.e. no timeout) in case some error happens there.
Hmm. Without looking at the source code, off hand I would say there is no communication between the two pserves. We have been experimenting with an api where a secondary instance asks the primary instance for some information but this is currently on hold.
right... so the only 'communication' is through mongodb in some sense.
During a fleet visit we should probably turn on access logging for waitress (see https://github.com/official-stockfish/fishtest/issues/2094#issuecomment-2201994277 ). If then pserve hangs (and it really seems to be a pserve hang, and not fishtest) we would know which api was served last, and by which thread.
Now observed repeated under load that that these two pserves hang simultaneously under load. Since it always involved the upload_pgn api this maybe the place to start looking?