Closed benoit74 closed 9 months ago
@benoit74 the same thing happened about now. I didn't gather information on the failing container as I saw the alert quite late and rushed to restore the service. We need to look into this soon
OK 😢
Do not hesitate to alert me when it happen again (if I'm awake 😉), without diving a bit into the container to see what happens, it will be hard to fix. From my understanding it looks like
Some potential explanations here: https://stackoverflow.com/questions/41377059/how-to-solve-errno-11-resource-temporarily-unavailable-using-uwsgi-nginx
Especially running ss -xl
during an outage would be very precious.
I don't really get why the system does not recover automatically once the load decrease ... maybe because workers are continuously overwhelming the API once it starts to get errors and the load never decreases?
Do you have any good source of information about how to observe what is currently happening to uwsgi processes, how to restart them, ...? Is it correct that we do not use guvicorn at all in the Zimfarm setup?
Indeed we don't. This ticket might be our opportunity to change the ZF's base image.
Yet other outages this morning at 2:03:42 AM (for 2 hours and 42 minutes), at 4:53:27 AM (for 18 minutes and 48 seconds) and at 5:18:57 AM (I then restarted the pod)
Inside the pod:
root@api-deployment-d5dfb9b6f-6v9jm:/app# ss -xl
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
u_str LISTEN 0 1024 /var/run/supervisor.sock.1 1444064086 * 0
u_str LISTEN 101 100 /tmp/uwsgi.sock 1444065071 * 0
Clearly, UWSGI is overloaded. And the automatic resolution of the first two incidents shows that the system achieves to recover. It is just probably overloaded with very long running requests / queries.
I did not noticed this before (even if it was present) but we have many messages like that:
ERRO pool zimfarm-periodic event buffer overflowed, discarding event xxxx
Don't know if this is the root-cause or only a side-effect.
Slightly worried about this as the user visible effects of such a problem can be very various and difficult to trace down IMO. Do we have a way to identify if which requests take more than a few seconds to execute?
We have two different things to do: the logger issue which will be able to inform us of what happens and we need to switch base image which will allow us to better configure the WSGI bridge. Since we're currently on a si gie instance, maybe increase the number of workers until those tasks are completed
The user visible effect is simple, during an outage all requests to the API timeout or return a 502. This is however worrying since it means the workers cannot update their status anymore, complete tasks, start new ones, etc ...
During an outage, we should run the following SQL query to find activity at the DB level:
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
Currently there is no long running queries at the DB.
Finishing the task https://github.com/kiwix/k8s/issues/5 will help, will be done this week.
Having more info afterwards at the DB level is also possible, for instance with the pgstatstatements module, but this is also something which consumes resources so its activation is not something to do without scrutiny. And there is maybe other tools possible, but I have too few experience on this, we need to document ourselves should we decide to make progress on this track of better understanding activity at the DB level (I'm not sure this is the issue).
As already mentioned yesterday, it looks like there is a very strong correlation with https://github.com/openzim/zimfarm/issues/817
Probable explanation is :
Closing it for now since it has been two weeks without occurence and probable root cause has been identified.
Would that not be wise to code these tasks in an asynchronous manner, ie using a worker which does not block the main loop?
Of course, we already discussed about it with @rgaudin but I forgot to open an issue. Now it is done: https://github.com/openzim/zimfarm/issues/822
Re-opening it since it occurred today. See https://github.com/openzim/zimfarm/issues/826 for an more detailed understanding of the situation
Closed again
At 2023-08-22 01:37:07 Zimfarm API stopped to work properly.
Logs where full of errors
connect() to unix:///tmp/uwsgi.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 100.64.6.28
.100.64.6.28
is the nginx ingress IP.Request where still received but the all finished with the same error message.
Restarting the API pod manually (at 4:33 AM UTC) was sufficient to get immediately get rid of those errors.
This ticket is meant to trace the event, should it happen again in the future. I do not expect to take any immediate action.