Closed severo closed 2 years ago
OK, done here: https://grafana.huggingface.tech/d/SaHl2KX7z/datasets-server?orgId=1. Will be improved progressively
I tried to re-enable the metrics with https://github.com/huggingface/datasets-server/pull/298, since we know have indexes but had to revert immediately because the pod became unreachable. I'll try to:
Would you want to go through your current monitoring stack together ? Maybe doing a bit of peer programming on this can help you out
Beware: as we now (#304) have 3 replicas of the api
app, I'm not sure how we should manage the starlette metrics: https://datasets-server.huggingface.tech/metrics will give the metrics for the particular node (and possibly also the particular uvicorn worker) that has been reached -> we have currently 9 x 3 = 27
parallel starlette apps (isn't it too much, by the way?)
You may indeed not need that many starlette apps. Do you know what your total load looks like in req/s ?
Nope: that's why I need to setup the monitoring correctly!
I reduced to 6 api pods, with only one uvicorn worker each.
About the API:
/valid
are very long: do they block the incoming requests?):
For the requests: https://grafana.huggingface.tech/d/ednzOLExt/datasets-server-api-endpoints?orgId=1, it's stable at about 2 requests per second (among the 6 pods), mainly for /healthcheck
(polled by Kubernetes and by BetterUptime) and /metrics
(polled by Prometheus). The requests for the substantive endpoints /splits
, /rows
and /valid
are respectively 0.2, 0.2 and 0.02 requests per second.
The response time of the 0.95 quantile (ie: 95% of the responses take less than this duration) per endpoint is very interesting (beware: log scale):
/valid
: 8s!/splits
: 1s!/rows
: 100ms/webhook
: 50ms/metrics
: 9ms/healthcheck
: 4msBEWARE: the requests to /valid are very long: do they block the incoming requests?)
Depends on if your long running query is blocking the GIL or not. If you have async calls, it should be able to switch and take care of other requests, if it's computing something then yeah, probably blocking everything else.
Regarding https://github.com/huggingface/datasets-server/issues/250#issuecomment-1136328511, it seems like starlette-prometheus supports multiprocess mode:
By preparing a directory on the disk, and setting the PROMETHEUS_MULTIPROC_DIR
env var, we should be good.
A discussion about the issue: https://echorand.me/posts/python-prometheus-monitoring-options/
Note also that we copied code from starlette-prometheus to customize the /metrics endpoint, and the copied code includes the support for multiprocesses:
Related to #2
/metrics
endpoint using the Prometheus spec in the API, eg using https://github.com/prometheus/client_python - see #258. Beware: cache and queue metrics removed after https://github.com/huggingface/datasets-server/issues/279