Zimfarm API got stuck - Githubissues

benoit74 commented 10 months ago

At 2023-08-22 01:37:07 Zimfarm API stopped to work properly.

Logs where full of errors connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28.

100.64.6.28 is the nginx ingress IP.

Request where still received but the all finished with the same error message.

100.64.6.28 - - [22/Aug/2023:04:30:18 +0000] "PATCH /v1/tasks/5d4d0e3b-8273-4d1b-ba35-30cac798a6ed HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "147.229.8.218"                                                                          │
│ 2023/08/22 04:30:18 [error] 15#15: *256357 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/875deff1-f4a7-4ffc-aad │
│ 100.64.6.28 - - [22/Aug/2023:04:30:18 +0000] "PATCH /v1/tasks/875deff1-f4a7-4ffc-aad4-445d07d74bcf HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "147.229.8.218"                                                                          │
│ 2023/08/22 04:30:18 [warn] 15#15: *256333 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000068205, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/60faed9f-264a-490c-9919-e33b96637623 HTT │
│ 2023/08/22 04:30:18 [error] 15#15: *256333 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/60faed9f-264a-490c-991 │
│ 100.64.6.28 - - [22/Aug/2023:04:30:18 +0000] "PATCH /v1/tasks/60faed9f-264a-490c-9919-e33b96637623 HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "147.229.8.218"                                                                          │
│ 2023/08/22 04:30:20 [error] 15#15: *256364 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/874d49f7-236a-418e-abf │
│ 100.64.6.28 - - [22/Aug/2023:04:30:20 +0000] "PATCH /v1/tasks/874d49f7-236a-418e-abf0-08d69624bf7f HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "185.15.56.1"                                                                            │
│ 2023/08/22 04:30:23 [error] 15#15: *255878 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "GET /v1/requested-tasks/?limit=20 HTTP │
│ 100.64.6.28 - - [22/Aug/2023:04:30:23 +0000] "GET /v1/requested-tasks/?limit=20 HTTP/1.1" 502 559 "https://farm.openzim.org/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "104 │
│ 2023/08/22 04:30:34 [warn] 15#15: *256357 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000068206, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/53c765c4-64af-41e0-86dc-bf49e41d86e5 HTT │
│ 2023/08/22 04:30:34 [error] 15#15: *256357 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/53c765c4-64af-41e0-86d │
│ 100.64.6.28 - - [22/Aug/2023:04:30:34 +0000] "PATCH /v1/tasks/53c765c4-64af-41e0-86dc-bf49e41d86e5 HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "173.73.128.55"                                                                          │
│ 2023/08/22 04:30:36 [error] 15#15: *255878 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/742bc05b-9ac3-4a03-8c8 │
│ 100.64.6.28 - - [22/Aug/2023:04:30:36 +0000] "PATCH /v1/tasks/742bc05b-9ac3-4a03-8c84-5e8a00095a7b HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "173.73.128.55"                                                                          │
│ 2023/08/22 04:30:37 [error] 15#15: *256357 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "OPTIONS /v1/requested-tasks/?limit=20  │
│ 100.64.6.28 - - [22/Aug/2023:04:30:37 +0000] "OPTIONS /v1/requested-tasks/?limit=20 HTTP/1.1" 502 157 "https://farm.openzim.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/116.0" "45.83.230.1"     │
│ 2023/08/22 04:30:37 [error] 15#15: *256420 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/eac693d5-6529-4929-95c │
│ 100.64.6.28 - - [22/Aug/2023:04:30:37 +0000] "PATCH /v1/tasks/eac693d5-6529-4929-95c3-d8bbd208b5a3 HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "147.229.8.218"                                                                          │
│ 2023/08/22 04:30:38 [warn] 15#15: *256364 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000068207, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/3c62cb6a-31f4-401d-8561-bb801f8d2c7a HTT │
│ 2023/08/22 04:30:38 [error] 15#15: *256364 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/3c62cb6a-31f4-401d-856 │
│ 100.64.6.28 - - [22/Aug/2023:04:30:38 +0000] "PATCH /v1/tasks/3c62cb6a-31f4-401d-8561-bb801f8d2c7a HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "173.73.128.55"                                                                          │
│ 2023/08/22 04:30:41 [error] 15#15: *255978 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/5cb038d2-d889-47a2-b28 │
│ 100.64.6.28 - - [22/Aug/2023:04:30:41 +0000] "PATCH /v1/tasks/5cb038d2-d889-47a2-b283-b2dc7f23f559 HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "185.15.56.1"                                                                            │
│ 2023/08/22 04:30:42 [warn] 15#15: *256344 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000068208, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/1d3c13bf-fe49-4f33-b5f5-c90fb7b0489a HTT │
│ 2023/08/22 04:30:42 [error] 15#15: *256344 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/1d3c13bf-fe49-4f33-b5f │
│ 100.64.6.28 - - [22/Aug/2023:04:30:42 +0000] "PATCH /v1/tasks/1d3c13bf-fe49-4f33-b5f5-c90fb7b0489a HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "173.73.128.55"                                                                          │
│ 2023/08/22 04:30:47 [error] 15#15: *256420 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/78f5ab97-1423-4f7e-a9a │
│ 100.64.6.28 - - [22/Aug/2023:04:30:47 +0000] "PATCH /v1/tasks/78f5ab97-1423-4f7e-a9ad-cd7f6b1d805d HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "185.15.56.1"                                                                            │
│ 2023/08/22 04:30:49 [error] 15#15: *256357 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "GET /v1/requested-tasks/worker?worker= │
│ 100.64.6.28 - - [22/Aug/2023:04:30:49 +0000] "GET /v1/requested-tasks/worker?worker=ondemand&avail_cpu=6&avail_memory=11811160064&avail_disk=165295476736 HTTP/1.1" 502 157 "-" "python-requests/2.30.0" "92.243.27.71"                    │
│ 2023/08/22 04:30:59 [warn] 15#15: *255978 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000068209, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/cc10dbf3-eb1b-4d92-aa36-020f2b57f7ae HTT │
│ 2023/08/22 04:30:59 [error] 15#15: *255978 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "PATCH /v1/tasks/cc10dbf3-eb1b-4d92-aa3 │
│ 100.64.6.28 - - [22/Aug/2023:04:30:59 +0000] "PATCH /v1/tasks/cc10dbf3-eb1b-4d92-aa36-020f2b57f7ae HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "65.28.105.18"                                                                           │
│ 2023-08-22 04:31:00,983 ERRO pool zimfarm-periodic event buffer overflowed, discarding event 11260                                                                                                                                         │
│ 2023/08/22 04:31:05 [error] 15#15: *256333 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "GET / HTTP/1.1", upstream: "uwsgi://un │
│ 100.64.6.28 - - [22/Aug/2023:04:31:05 +0000] "GET / HTTP/1.1" 502 157 "https://api.farm.openzim.org" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)" "216.245.221.88"                                             │
│ 2023/08/22 04:31:08 [error] 15#15: *256364 connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28, server: , request: "POST /v1/auth/ssh_authorize HTTP/1.1", │
│ 100.64.6.28 - - [22/Aug/2023:04:31:08 +0000] "POST /v1/auth/ssh_authorize HTTP/1.1" 502 157 "-" "python-requests/2.31.0" "185.15.56.1"

Restarting the API pod manually (at 4:33 AM UTC) was sufficient to get immediately get rid of those errors.

This ticket is meant to trace the event, should it happen again in the future. I do not expect to take any immediate action.

rgaudin commented 10 months ago

@benoit74 the same thing happened about now. I didn't gather information on the failing container as I saw the alert quite late and rushed to restore the service. We need to look into this soon

benoit74 commented 10 months ago

OK 😢 Do not hesitate to alert me when it happen again (if I'm awake 😉), without diving a bit into the container to see what happens, it will be hard to fix. From my understanding it looks like Some potential explanations here: https://stackoverflow.com/questions/41377059/how-to-solve-errno-11-resource-temporarily-unavailable-using-uwsgi-nginx Especially running ss -xl during an outage would be very precious. I don't really get why the system does not recover automatically once the load decrease ... maybe because workers are continuously overwhelming the API once it starts to get errors and the load never decreases?

benoit74 commented 10 months ago

Do you have any good source of information about how to observe what is currently happening to uwsgi processes, how to restart them, ...? Is it correct that we do not use guvicorn at all in the Zimfarm setup?

rgaudin commented 10 months ago

Indeed we don't. This ticket might be our opportunity to change the ZF's base image.

rgaudin commented 10 months ago

https://github.com/rgaudin/uwsgi-nginx-docker/tree/master/python3.8

benoit74 commented 10 months ago

Yet other outages this morning at 2:03:42 AM (for 2 hours and 42 minutes), at 4:53:27 AM (for 18 minutes and 48 seconds) and at 5:18:57 AM (I then restarted the pod)

Inside the pod:

root@api-deployment-d5dfb9b6f-6v9jm:/app# ss -xl
Netid                State                 Recv-Q                Send-Q                                            Local Address:Port                                      Peer Address:Port
u_str                LISTEN                0                     1024                                 /var/run/supervisor.sock.1 1444064086                                           * 0
u_str                LISTEN                101                   100                                             /tmp/uwsgi.sock 1444065071                                           * 0

Clearly, UWSGI is overloaded. And the automatic resolution of the first two incidents shows that the system achieves to recover. It is just probably overloaded with very long running requests / queries.

I did not noticed this before (even if it was present) but we have many messages like that:

 ERRO pool zimfarm-periodic event buffer overflowed, discarding event xxxx

Don't know if this is the root-cause or only a side-effect.

kelson42 commented 10 months ago

Slightly worried about this as the user visible effects of such a problem can be very various and difficult to trace down IMO. Do we have a way to identify if which requests take more than a few seconds to execute?

rgaudin commented 10 months ago

We have two different things to do: the logger issue which will be able to inform us of what happens and we need to switch base image which will allow us to better configure the WSGI bridge. Since we're currently on a si gie instance, maybe increase the number of workers until those tasks are completed

benoit74 commented 10 months ago

The user visible effect is simple, during an outage all requests to the API timeout or return a 502. This is however worrying since it means the workers cannot update their status anymore, complete tasks, start new ones, etc ...

During an outage, we should run the following SQL query to find activity at the DB level:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

Currently there is no long running queries at the DB.

Finishing the task https://github.com/kiwix/k8s/issues/5 will help, will be done this week.

benoit74 commented 10 months ago

Having more info afterwards at the DB level is also possible, for instance with the pgstatstatements module, but this is also something which consumes resources so its activation is not something to do without scrutiny. And there is maybe other tools possible, but I have too few experience on this, we need to document ourselves should we decide to make progress on this track of better understanding activity at the DB level (I'm not sure this is the issue).

benoit74 commented 9 months ago

As already mentioned yesterday, it looks like there is a very strong correlation with https://github.com/openzim/zimfarm/issues/817

Probable explanation is :

requests to slack / CMS are sometimes never getting a response
no timeout was set so zimfarm waits 1 min
in case of peak of slack / CMS requests (induced by peak of task completion), Zimfarm has too many threads waiting for slack / CMS and reaches a point where no more request can be accepted

Closing it for now since it has been two weeks without occurence and probable root cause has been identified.

kelson42 commented 9 months ago

Would that not be wise to code these tasks in an asynchronous manner, ie using a worker which does not block the main loop?

benoit74 commented 9 months ago

Of course, we already discussed about it with @rgaudin but I forgot to open an issue. Now it is done: https://github.com/openzim/zimfarm/issues/822

benoit74 commented 9 months ago

Re-opening it since it occurred today. See https://github.com/openzim/zimfarm/issues/826 for an more detailed understanding of the situation

benoit74 commented 9 months ago

Closed again

kiwix / operations

Zimfarm API got stuck #111