Closed benoit74 closed 6 months ago
Might be related to FAILED Pipeline https://farm.openzim.org/pipeline/filter-failed
I confirm this is related to failed pipelines. I did not realized it was looping over all schedules on first look. The issue is that:
Once the workers have been scaled, the UI refresh is not causing 502 errors anymore, proving somehow that the server is capable to handle the load.
The reason for those requests is that we need schedule-only information: details of the last run of that schedule.
When we added this feature, the debate was about keeping the API Restful (ie. a concept endpoint exposes only that concept) so the /task/xxx endpoint could not provide info from the schedule (which is actually detail about another task).
Now that we are using an RDBMS, the line is blurred a bit (was different collections in mongo). We can discuss exposing it or exposing it conditionnally via the task endpoint.
I am 100% sure that those 200 requests were handled fine at the time but this was a long time ago, before the k8s switch. @benoit74 can you commit the uwsgi config change and close this ticket?
If there's interest in changing the behavior, we can open a separate enhancement ticket.
As discussed live:
Since "forever", it looks like farm.openzim.org is experiencing some burst of 502 errors.
Last month overview:![image](https://github.com/kiwix/k8s/assets/7102089/4241320a-35a1-484e-8908-d9c9f59be02d)
Looking into the logs, it looks like this is a timeout of Python code, failing to respond to Nginx in due time, but it is not identical to the 499 errors we already know.
It happens usually on
GET /v1/schedules/{schedule_name}
with always 5 or 10 requests all happening at the same time.Sometimes it happens on
OPTIONS /v1/schedules/{schedule_name}
and rarely onPOST /v1/auth/token
(but still with 5 or 10 requests all at the same time).I did not find a component in our code which would perform all these requests.
Impact seems minimal so far.