Add a way to gracefully stop the workers

severo commented 2 years ago

Currently, if we stop the workers:

kubectl scale --replicas=0 deploy/datasets-server-prod-datasets-worker
kubectl scale --replicas=0 deploy/datasets-server-prod-splits-worker

the started jobs will remain forever and potentially will block other jobs from the same dataset (because of MAX_JOBS_PER_DATASET).

We want:

to detect dead jobs (the worker has been removed), and move them to the waiting status again
also if possible: clean the workers (ie: stop the current jobs, and move them again to the waiting status - or move to INTERRUPTED state, and create a new waiting one) before stopping the worker

severo commented 2 years ago

See https://huggingface.slack.com/archives/C0311GZ7R6K/p1652714403624539

severo commented 2 years ago

I added (https://github.com/huggingface/datasets-server/pull/275) a pod dedicated to launching scripts.

For example, to cancel all the started jobs in the splits queue:

stop all the workers:

kubectl scale --replicas=0 deploy/datasets-server-prod-splits-worker

launch the script

kubectl exec datasets-server-prod-admin-5b8486886f-ngk2g -- make cancel-started-split-jobs

restart the workers:

kubectl scale --replicas=12 deploy/datasets-server-prod-splits-worker

XciD commented 2 years ago

Looks like a great short-term solution. But still a manual one, We should keep this issue open in order to implement a long term solution wdyt ?

severo commented 2 years ago

Absolutely (that's why I didn't close it).

In particular, we have to do https://github.com/huggingface/datasets-server/issues/264#issuecomment-1128622313 for the two queues after every helm upgrade!

severo commented 2 years ago

So, we want to:

[x] cancel the current job (ie: change status to CANCELLED, and clone it to a new one with the status WAITING) before a worker ends (SIGINT or any other signal that will stop the worker). Note that since https://github.com/huggingface/datasets-server/pull/285, the max available RAM per pod is limited (to 4GiB currently), which kills the pod, which restarts, but the "started" job stays started forever (and may block the other splits from the same dataset) - should be fixed by #352
[ ] use a heartbeat system on the job, that will update a date field (lastUpdate) in the job entry of the database. A job that has not been updated for long is considered stalled and should be canceled (by... who?).
[ ] allow SWAP? It requires configuration on the nodes (infra), as well as on the pods: https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory

severo commented 2 years ago

no more datasets blocklist,
a job that fails with a 500 error is now re-enqueued if the max number of retries has not been reached, and it will not kill the worker anymore

severo commented 2 years ago

The main problem is that, when a worker pod consumes more RAM than allowed (which occurs frequently with big datasets), Kubernetes kills the pod with the SIGKILL signal, which cannot be intercepted by the killed process. The only solution is to dedicate more memory or avoid using too much memory. And if a pod has been killed (OOMKilled), we should clean up on the next start, because we cannot do it before the process is killed. It's kind of tricky with multiple jobs. The heartbeat idea might help.

severo commented 2 years ago

See #357 and #358 where we increase the available resources for the workers.

severo commented 2 years ago

Kubernetes trick: to know what happened to a pod that was killed:

k logs datasets-server-prod-datasets-worker-776b774978-rtj9h -

It shows the logs until the crash

severo commented 2 years ago

An example of dataset job (not split job! we can't even know the list of splits!) that crashes due to a lack of RAM: echarlaix/gqa-lxmert

INFO: 2022-06-16 17:30:51,190 - datasets_server.worker - compute dataset 'echarlaix/gqa-lxmert'
Downloading builder script: 100%|██████████| 5.51k/5.51k [00:00<00:00, 3.92MB/s]
make: *** [Makefile:19: run] Killed

This means that it needs more than 30 GiB RAM!

severo commented 2 years ago

split jobs:

INFO: 2022-06-16 18:42:21,126 - datasets_server.worker - compute split 'train' from dataset 'imthanhlv/binhvq_news21_raw' with config 'imthanhlv--binhvq_news21_raw'
make: *** [Makefile:19: run] Killed

INFO: 2022-06-16 18:44:44,281 - datasets_server.worker - compute split 'test' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.04MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 7.65MB/s]
2022-06-16 18:44:46.305062: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-06-16 18:44:46.389820: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-16 18:44:46.389865: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-16 18:44:46.390005: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datasets-server-prod-splits-worker-7f69bdfd9f-kzrdt): /proc/driver/nvidia/version does not exist
2022-06-16 18:44:46.390379: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
make: *** [Makefile:19: run] Killed

INFO: 2022-06-16 20:51:02,467 - datasets_server.worker - compute split 'train' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.98MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 9.22MB/s]
2022-06-16 20:51:04.216905: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed

INFO: 2022-06-16 20:51:03,387 - datasets_server.worker - compute split 'validation' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.65MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.14MB/s]
2022-06-16 20:51:05.349829: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed

severo commented 2 years ago

Idea by @XciD : change the way we launch jobs. Instead of creating/scaling the worker pods "manually", we could have code that uses the native Kubernetes Jobs (https://kubernetes.io/docs/concepts/workloads/controllers/job/):

we could assign a given amount of CPU and RAM to each job
it would trigger the autoscale (of nodes) automatically

See an example in the Inference API: https://github.com/huggingface/api-inference/blob/8b89efbb02c995d4f8e0b4f3d0d3b5d4620cd1e0/master/app/deploy.py#L731

severo commented 2 years ago

huggingface / dataset-viewer

Add a way to gracefully stop the workers #264