Setup prometheus + grafana

severo commented 2 years ago

Related to #2

[x] expose a /metrics endpoint using the Prometheus spec in the API, eg using https://github.com/prometheus/client_python - see #258. Beware: cache and queue metrics removed after https://github.com/huggingface/datasets-server/issues/279
[x] Use a ServiceMonitor in the Helm chart: https://github.com/huggingface/tensorboard-launcher/blob/main/kube/templates/servicemonitor.yaml: see #260
[x] create a dashboard in grafana. The recommended process is to:
- [x] ensure Grafana can see the metrics: OK, at https://grafana.huggingface.tech/explore?orgId=1&left=%7B%22datasource%22:%22Prometheus%20EKS%20Hub%20Prod%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22instant%22:true,%22range%22:false,%22exemplar%22:false,%22expr%22:%22%7Bjob%3D%5C%22datasets-server-prod-api%5C%22,%20__name__%3D~%5C%22.%2B%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D
- [x] read the doc written by @McPatate
- [x] create a new JSON in https://github.com/huggingface/infra/blob/main/projects/monitoring/grafana/dashboards/hub/, copying one of the existing ones
- [x] create a PR: an ephemeral grafana will be available to test
- [x] tweak the dashboard in the grafana frontend: https://grafana.huggingface.tech/?orgId=1 -> save: will give an error but allow to download the JSON to paste into the PR
[x] add metrics about the cache and the queue. See discussion in https://github.com/huggingface/datasets-server/issues/279 and work on #310

severo commented 2 years ago

OK, done here: https://grafana.huggingface.tech/d/SaHl2KX7z/datasets-server?orgId=1. Will be improved progressively

severo commented 2 years ago

I tried to re-enable the metrics with https://github.com/huggingface/datasets-server/pull/298, since we know have indexes but had to revert immediately because the pod became unreachable. I'll try to:

[x] only introduce the starlette metrics (from https://github.com/perdy/starlette-prometheus/), without the cache+queue metrics from mongo - see #300
[x] try to add the queue metrics, but finding a way to test and quantify the additional queries to mongo, and optimize them before trying in production
[x] do the same with cache

McPatate commented 2 years ago

Would you want to go through your current monitoring stack together ? Maybe doing a bit of peer programming on this can help you out

severo commented 2 years ago

Beware: as we now (#304) have 3 replicas of the api app, I'm not sure how we should manage the starlette metrics: https://datasets-server.huggingface.tech/metrics will give the metrics for the particular node (and possibly also the particular uvicorn worker) that has been reached -> we have currently 9 x 3 = 27 parallel starlette apps (isn't it too much, by the way?)

McPatate commented 2 years ago

You may indeed not need that many starlette apps. Do you know what your total load looks like in req/s ?

severo commented 2 years ago

Nope: that's why I need to setup the monitoring correctly!

severo commented 2 years ago

I reduced to 6 api pods, with only one uvicorn worker each.

severo commented 2 years ago

About the API:

https://grafana.huggingface.tech/d/a164a7f0339f99e89cea5cb47e9be617/kubernetes-compute-resources-workload?orgId=1&refresh=10s&var-datasource=Prometheus%20EKS%20Hub%20Prod&var-cluster=&var-namespace=datasets-server&var-type=deployment&var-workload=datasets-server-prod-api&from=now-12h&to=now

the CPU usage is really not a problem, some sparks at 10%, or 50%, sometimes 100%, but really not an issue. We could probably reduce the number of API pods to 2 or 3 (BEWARE: the requests to /valid are very long: do they block the incoming requests?):
the RAM usage is also way under what we provisioned, stable at about 30%. We might divide the "requested memory" by two:

For the requests: https://grafana.huggingface.tech/d/ednzOLExt/datasets-server-api-endpoints?orgId=1, it's stable at about 2 requests per second (among the 6 pods), mainly for /healthcheck (polled by Kubernetes and by BetterUptime) and /metrics (polled by Prometheus). The requests for the substantive endpoints /splits, /rows and /valid are respectively 0.2, 0.2 and 0.02 requests per second.

The response time of the 0.95 quantile (ie: 95% of the responses take less than this duration) per endpoint is very interesting (beware: log scale):

/valid: 8s!
/splits: 1s!
/rows: 100ms
/webhook: 50ms
/metrics: 9ms
/healthcheck: 4ms

McPatate commented 2 years ago

BEWARE: the requests to /valid are very long: do they block the incoming requests?)

Depends on if your long running query is blocking the GIL or not. If you have async calls, it should be able to switch and take care of other requests, if it's computing something then yeah, probably blocking everything else.

severo commented 2 years ago

Regarding https://github.com/huggingface/datasets-server/issues/250#issuecomment-1136328511, it seems like starlette-prometheus supports multiprocess mode:

By preparing a directory on the disk, and setting the PROMETHEUS_MULTIPROC_DIR env var, we should be good.

A discussion about the issue: https://echorand.me/posts/python-prometheus-monitoring-options/

Note also that we copied code from starlette-prometheus to customize the /metrics endpoint, and the copied code includes the support for multiprocesses:

huggingface / dataset-viewer

Setup prometheus + grafana #250