Closed severo closed 2 years ago
See also https://github.com/huggingface/datasets-server/issues/91 (duplicate)
I added (https://github.com/huggingface/datasets-server/pull/275) a pod dedicated to launching scripts.
For example, to cancel all the started jobs in the splits queue:
kubectl scale --replicas=0 deploy/datasets-server-prod-splits-worker
kubectl exec datasets-server-prod-admin-5b8486886f-ngk2g -- make cancel-started-split-jobs
kubectl scale --replicas=12 deploy/datasets-server-prod-splits-worker
Looks like a great short-term solution. But still a manual one, We should keep this issue open in order to implement a long term solution wdyt ?
Absolutely (that's why I didn't close it).
In particular, we have to do https://github.com/huggingface/datasets-server/issues/264#issuecomment-1128622313 for the two queues after every helm upgrade
!
So, we want to:
CANCELLED
, and clone it to a new one with the status WAITING
) before a worker ends (SIGINT or any other signal that will stop the worker). Note that since https://github.com/huggingface/datasets-server/pull/285, the max available RAM per pod is limited (to 4GiB currently), which kills the pod, which restarts, but the "started" job stays started forever (and may block the other splits from the same dataset) - should be fixed by #352lastUpdate
) in the job entry of the database. A job that has not been updated for long is considered stalled and should be canceled (by... who?).Related: https://github.com/huggingface/datasets-server/pull/352:
The main problem is that, when a worker pod consumes more RAM than allowed (which occurs frequently with big datasets), Kubernetes kills the pod with the SIGKILL signal, which cannot be intercepted by the killed process. The only solution is to dedicate more memory or avoid using too much memory. And if a pod has been killed (OOMKilled), we should clean up on the next start, because we cannot do it before the process is killed. It's kind of tricky with multiple jobs. The heartbeat idea might help.
See #357 and #358 where we increase the available resources for the workers.
Kubernetes trick: to know what happened to a pod that was killed:
k logs datasets-server-prod-datasets-worker-776b774978-rtj9h -
It shows the logs until the crash
An example of dataset job (not split job! we can't even know the list of splits!) that crashes due to a lack of RAM: echarlaix/gqa-lxmert
INFO: 2022-06-16 17:30:51,190 - datasets_server.worker - compute dataset 'echarlaix/gqa-lxmert'
Downloading builder script: 100%|██████████| 5.51k/5.51k [00:00<00:00, 3.92MB/s]
make: *** [Makefile:19: run] Killed
This means that it needs more than 30 GiB RAM!
split jobs:
INFO: 2022-06-16 18:42:21,126 - datasets_server.worker - compute split 'train' from dataset 'imthanhlv/binhvq_news21_raw' with config 'imthanhlv--binhvq_news21_raw'
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 18:44:44,281 - datasets_server.worker - compute split 'test' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.04MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 7.65MB/s]
2022-06-16 18:44:46.305062: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
2022-06-16 18:44:46.389820: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-06-16 18:44:46.389865: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-06-16 18:44:46.390005: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datasets-server-prod-splits-worker-7f69bdfd9f-kzrdt): /proc/driver/nvidia/version does not exist
2022-06-16 18:44:46.390379: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 20:51:02,467 - datasets_server.worker - compute split 'train' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.98MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 9.22MB/s]
2022-06-16 20:51:04.216905: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed
INFO: 2022-06-16 20:51:03,387 - datasets_server.worker - compute split 'validation' from dataset 'openclimatefix/nimrod-uk-1km' with config 'sample'
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.65MB/s]
Downloading builder script: 100%|██████████| 15.2k/15.2k [00:00<00:00, 6.14MB/s]
2022-06-16 20:51:05.349829: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
make: *** [Makefile:19: run] Killed
Idea by @XciD : change the way we launch jobs. Instead of creating/scaling the worker pods "manually", we could have code that uses the native Kubernetes Jobs (https://kubernetes.io/docs/concepts/workloads/controllers/job/):
See an example in the Inference API: https://github.com/huggingface/api-inference/blob/8b89efbb02c995d4f8e0b4f3d0d3b5d4620cd1e0/master/app/deploy.py#L731
See also #526
moved to the internal tracker
Currently, if we stop the workers:
the started jobs will remain forever and potentially will block other jobs from the same dataset (because of MAX_JOBS_PER_DATASET).
We want: