kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
5 stars 0 forks source link

farm.openzim.org is regularly experiencing bursts of 502 errors #149

Closed benoit74 closed 6 months ago

benoit74 commented 7 months ago

Since "forever", it looks like farm.openzim.org is experiencing some burst of 502 errors.

Last month overview: image

Looking into the logs, it looks like this is a timeout of Python code, failing to respond to Nginx in due time, but it is not identical to the 499 errors we already know.

It happens usually on GET /v1/schedules/{schedule_name} with always 5 or 10 requests all happening at the same time.

Sometimes it happens on OPTIONS /v1/schedules/{schedule_name} and rarely on POST /v1/auth/token (but still with 5 or 10 requests all at the same time).

I did not find a component in our code which would perform all these requests.

Impact seems minimal so far.

rgaudin commented 6 months ago

Might be related to FAILED Pipeline https://farm.openzim.org/pipeline/filter-failed

benoit74 commented 6 months ago

I confirm this is related to failed pipelines. I did not realized it was looping over all schedules on first look. The issue is that:

Once the workers have been scaled, the UI refresh is not causing 502 errors anymore, proving somehow that the server is capable to handle the load.

rgaudin commented 6 months ago

The reason for those requests is that we need schedule-only information: details of the last run of that schedule.

When we added this feature, the debate was about keeping the API Restful (ie. a concept endpoint exposes only that concept) so the /task/xxx endpoint could not provide info from the schedule (which is actually detail about another task).

Now that we are using an RDBMS, the line is blurred a bit (was different collections in mongo). We can discuss exposing it or exposing it conditionnally via the task endpoint.

I am 100% sure that those 200 requests were handled fine at the time but this was a long time ago, before the k8s switch. @benoit74 can you commit the uwsgi config change and close this ticket?

If there's interest in changing the behavior, we can open a separate enhancement ticket.

benoit74 commented 6 months ago

As discussed live: