huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
698 stars 77 forks source link

Add a way to gracefully stop the workers #264

Closed severo closed 2 years ago

severo commented 2 years ago

Currently, if we stop the workers:

kubectl scale --replicas=0 deploy/datasets-server-prod-datasets-worker
kubectl scale --replicas=0 deploy/datasets-server-prod-splits-worker

the started jobs will remain forever and potentially will block other jobs from the same dataset (because of MAX_JOBS_PER_DATASET).

We want:

  1. to detect dead jobs (the worker has been removed), and move them to the waiting status again
  2. also if possible: clean the workers (ie: stop the current jobs, and move them again to the waiting status - or move to INTERRUPTED state, and create a new waiting one) before stopping the worker
severo commented 2 years ago

See https://huggingface.slack.com/archives/C0311GZ7R6K/p1652714403624539

severo commented 2 years ago

See also https://github.com/huggingface/datasets-server/issues/91 (duplicate)

severo commented 2 years ago

I added (https://github.com/huggingface/datasets-server/pull/275) a pod dedicated to launching scripts.

For example, to cancel all the started jobs in the splits queue:

XciD commented 2 years ago

Looks like a great short-term solution. But still a manual one, We should keep this issue open in order to implement a long term solution wdyt ?

severo commented 2 years ago

Absolutely (that's why I didn't close it).

In particular, we have to do https://github.com/huggingface/datasets-server/issues/264#issuecomment-1128622313 for the two queues after every helm upgrade!

severo commented 2 years ago

So, we want to:

severo commented 2 years ago

Related: https://github.com/huggingface/datasets-server/pull/352:

severo commented 2 years ago

The main problem is that, when a worker pod consumes more RAM than allowed (which occurs frequently with big datasets), Kubernetes kills the pod with the SIGKILL signal, which cannot be intercepted by the killed process. The only solution is to dedicate more memory or avoid using too much memory. And if a pod has been killed (OOMKilled), we should clean up on the next start, because we cannot do it before the process is killed. It's kind of tricky with multiple jobs. The heartbeat idea might help.

severo commented 2 years ago

See #357 and #358 where we increase the available resources for the workers.

severo commented 2 years ago

Kubernetes trick: to know what happened to a pod that was killed:

k logs datasets-server-prod-datasets-worker-776b774978-rtj9h -

It shows the logs until the crash

severo commented 2 years ago

An example of dataset job (not split job! we can't even know the list of splits!) that crashes due to a lack of RAM: echarlaix/gqa-lxmert

INFO: 2022-06-16 17:30:51,190 - datasets_server.worker - compute dataset 'echarlaix/gqa-lxmert'
Downloading builder script: 100%|██████████| 5.51k/5.51k [00:00<00:00, 3.92MB/s]
make: *** [Makefile:19: run] Killed

This means that it needs more than 30 GiB RAM!

severo commented 2 years ago

split jobs:

INFO: 2022-06-16 18:42:21,126 - datasets_server.worker - compute split 'train' from dataset 'imthanhlv/binhvq_news21_raw' with config 'imthanhlv--binhvq_news21_raw'
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 18:44:44,281 - datasets_server.worker - compute split 'test' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.04MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 7.65MB/s]
2022-06-16 18:44:46.305062: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-06-16 18:44:46.389820: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-16 18:44:46.389865: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-16 18:44:46.390005: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datasets-server-prod-splits-worker-7f69bdfd9f-kzrdt): /proc/driver/nvidia/version does not exist
2022-06-16 18:44:46.390379: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 20:51:02,467 - datasets_server.worker - compute split 'train' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.98MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 9.22MB/s]
2022-06-16 20:51:04.216905: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 20:51:03,387 - datasets_server.worker - compute split 'validation' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.65MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.14MB/s]
2022-06-16 20:51:05.349829: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed
severo commented 2 years ago

Idea by @XciD : change the way we launch jobs. Instead of creating/scaling the worker pods "manually", we could have code that uses the native Kubernetes Jobs (https://kubernetes.io/docs/concepts/workloads/controllers/job/):

See an example in the Inference API: https://github.com/huggingface/api-inference/blob/8b89efbb02c995d4f8e0b4f3d0d3b5d4620cd1e0/master/app/deploy.py#L731

severo commented 2 years ago

See also #526

severo commented 2 years ago

moved to the internal tracker