farm.openzim.org is regularly experiencing bursts of 502 errors

kiwix / operations

Kiwix Kubernetes Cluster

http://charts.k8s.kiwix.org/

5 stars 0 forks source link

farm.openzim.org is regularly experiencing bursts of 502 errors #149

Closed benoit74 closed 6 months ago

benoit74 commented 7 months ago

Since "forever", it looks like farm.openzim.org is experiencing some burst of 502 errors.

Last month overview:

Looking into the logs, it looks like this is a timeout of Python code, failing to respond to Nginx in due time, but it is not identical to the 499 errors we already know.

It happens usually on GET /v1/schedules/{schedule_name} with always 5 or 10 requests all happening at the same time.

Sometimes it happens on OPTIONS /v1/schedules/{schedule_name} and rarely on POST /v1/auth/token (but still with 5 or 10 requests all at the same time).

I did not find a component in our code which would perform all these requests.

Impact seems minimal so far.

rgaudin commented 6 months ago

Might be related to FAILED Pipeline https://farm.openzim.org/pipeline/filter-failed

benoit74 commented 6 months ago

I confirm this is related to failed pipelines. I did not realized it was looping over all schedules on first look. The issue is that:

the UI is issuing one API request per schedule in the UI (and there could be up to 200 items) and it is issuing them all at once
the UI is issuing these calls so fast that uvicorn doesn't have enough time to create more workers and handle the load

Once the workers have been scaled, the UI refresh is not causing 502 errors anymore, proving somehow that the server is capable to handle the load.

rgaudin commented 6 months ago

The reason for those requests is that we need schedule-only information: details of the last run of that schedule.

When we added this feature, the debate was about keeping the API Restful (ie. a concept endpoint exposes only that concept) so the /task/xxx endpoint could not provide info from the schedule (which is actually detail about another task).

Now that we are using an RDBMS, the line is blurred a bit (was different collections in mongo). We can discuss exposing it or exposing it conditionnally via the task endpoint.

I am 100% sure that those 200 requests were handled fine at the time but this was a long time ago, before the k8s switch. @benoit74 can you commit the uwsgi config change and close this ticket?

If there's interest in changing the behavior, we can open a separate enhancement ticket.

benoit74 commented 6 months ago

As discussed live:

You didn't got me correctly, I did find the solution, only assumed it could be the problem.
I close this issue in favor of upstream ones (including https://github.com/openzim/zimfarm/issues/883)
882 will be fixed soon

kiwix / operations

farm.openzim.org is regularly experiencing bursts of 502 errors #149

882 will be fixed soon