Fix gauge metrics in api and admin services

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

695 stars 77 forks source link

Fix gauge metrics in api and admin services #889

Open severo opened 1 year ago

severo commented 1 year ago

The API and admin services run using uvicorn with multiple workers. We set the PROMETHEUS_MULTIPROC_DIR environment variable (to /tmp), as described in https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn, which allows every worker to write in a dedicated file. See, for example, the content of /tmp for one of the pods of the admin service:

Each of the nine workers has its file for every metric type: counter, gauge_all, gauge_liveall, histogram.

But, as described in https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn, the "live" gauge metrics are special: they should only count the live workers, and to do that, the files starting with gauge_live should be deleted on a worker child exit. But we don't do it, since uvicorn does not allow to call a specific function on child exit.

Two solutions:

run the services with gunicorn
or, avoid using "live" gauges.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

let's keep open for now, even if it's not prioritary

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

keep open

severo commented 1 year ago

Note that the /tmp directory is deleted when the admin pod is deleted (ie: when we delete it manually, or when the image has changed, which occurs on each deployment). Maybe it's sufficient.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

keep open

severo commented 8 months ago

Not very important, let's close

severo commented 8 months ago

reopening, because I'm wondering if this issue, or https://github.com/huggingface/datasets-server/issues/2495, can affect the workers' autoscaling. Indeed, autoscaling relies on a metric, and if it's wrong, the autoscaling will not respond as expected. In Grafana (internal: https://grafana.huggingface.tech/d/i7gwsO5Vz/global-view?orgId=1&from=now-30m&to=now) we mitigate the issue by taking the median of values reported by the different pods, but I don't know how the autoscaler handles this. cc @rtrompier @lhoestq

severo commented 8 months ago

For example, currently, we have:

41 pending jobs (waiting + started)
7 waiting jobs but the "waiting jobs per worker size" gives:
light: 8
heavy: 22
medium: 23

As the threshold for heavy workers is 20, we keep scaling up, while there are nearly no waiting jobs for them.