huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
696 stars 78 forks source link

Setup prometheus + grafana #250

Closed severo closed 2 years ago

severo commented 2 years ago

Related to #2

severo commented 2 years ago

OK, done here: https://grafana.huggingface.tech/d/SaHl2KX7z/datasets-server?orgId=1. Will be improved progressively

severo commented 2 years ago

I tried to re-enable the metrics with https://github.com/huggingface/datasets-server/pull/298, since we know have indexes but had to revert immediately because the pod became unreachable. I'll try to:

McPatate commented 2 years ago

Would you want to go through your current monitoring stack together ? Maybe doing a bit of peer programming on this can help you out

severo commented 2 years ago

Beware: as we now (#304) have 3 replicas of the api app, I'm not sure how we should manage the starlette metrics: https://datasets-server.huggingface.tech/metrics will give the metrics for the particular node (and possibly also the particular uvicorn worker) that has been reached -> we have currently 9 x 3 = 27 parallel starlette apps (isn't it too much, by the way?)

McPatate commented 2 years ago

You may indeed not need that many starlette apps. Do you know what your total load looks like in req/s ?

severo commented 2 years ago

Nope: that's why I need to setup the monitoring correctly!

severo commented 2 years ago

I reduced to 6 api pods, with only one uvicorn worker each.

severo commented 2 years ago

About the API:

https://grafana.huggingface.tech/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&refresh=10s&var-datasource=Prometheus%20EKS%20Hub%20Prod&var-cluster=&var-namespace=datasets-server&var-type=deployment&var-workload=datasets-server-prod-api&from=now-12h&to=now

For the requests: https://grafana.huggingface.tech/d/ednzOLExt/datasets-server-api-endpoints?orgId=1, it's stable at about 2 requests per second (among the 6 pods), mainly for /healthcheck (polled by Kubernetes and by BetterUptime) and /metrics (polled by Prometheus). The requests for the substantive endpoints /splits, /rows and /valid are respectively 0.2, 0.2 and 0.02 requests per second.

Capture d’écran 2022-05-31 à 11 45 43

The response time of the 0.95 quantile (ie: 95% of the responses take less than this duration) per endpoint is very interesting (beware: log scale):

Capture d’écran 2022-05-31 à 11 49 58
McPatate commented 2 years ago

BEWARE: the requests to /valid are very long: do they block the incoming requests?)

Depends on if your long running query is blocking the GIL or not. If you have async calls, it should be able to switch and take care of other requests, if it's computing something then yeah, probably blocking everything else.

severo commented 2 years ago

Regarding https://github.com/huggingface/datasets-server/issues/250#issuecomment-1136328511, it seems like starlette-prometheus supports multiprocess mode:

By preparing a directory on the disk, and setting the PROMETHEUS_MULTIPROC_DIR env var, we should be good.

A discussion about the issue: https://echorand.me/posts/python-prometheus-monitoring-options/

Note also that we copied code from starlette-prometheus to customize the /metrics endpoint, and the copied code includes the support for multiprocesses: