Open severo opened 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
let's keep open for now, even if it's not prioritary
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
keep open
Note that the /tmp
directory is deleted when the admin pod is deleted (ie: when we delete it manually, or when the image has changed, which occurs on each deployment). Maybe it's sufficient.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
keep open
Not very important, let's close
reopening, because I'm wondering if this issue, or https://github.com/huggingface/datasets-server/issues/2495, can affect the workers' autoscaling. Indeed, autoscaling relies on a metric, and if it's wrong, the autoscaling will not respond as expected. In Grafana (internal: https://grafana.huggingface.tech/d/i7gwsO5Vz/global-view?orgId=1&from=now-30m&to=now) we mitigate the issue by taking the median of values reported by the different pods, but I don't know how the autoscaler handles this. cc @rtrompier @lhoestq
For example, currently, we have:
As the threshold for heavy workers is 20, we keep scaling up, while there are nearly no waiting jobs for them.
The API and admin services run using uvicorn with multiple workers. We set the
PROMETHEUS_MULTIPROC_DIR
environment variable (to/tmp
), as described in https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn, which allows every worker to write in a dedicated file. See, for example, the content of/tmp
for one of the pods of the admin service:Each of the nine workers has its file for every metric type:
counter
,gauge_all
,gauge_liveall
,histogram
.But, as described in https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn, the "live" gauge metrics are special: they should only count the live workers, and to do that, the files starting with
gauge_live
should be deleted on a worker child exit. But we don't do it, since uvicorn does not allow to call a specific function on child exit.Two solutions: