intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.72k stars 1.27k forks source link

RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>) #9287

Open fatenlouati opened 1 year ago

fatenlouati commented 1 year ago

when running my code in databricks using bigdl-orca, I got this error; `RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>)

the log:

Launching Ray on cluster with Spark barrier mode
Start to launch ray driver
Executing command: ray start --address 10.155.164.0:23706 --num-cpus 0 --node-ip-address 10.155.169.193
2023-10-26 13:49:57,646 INFO scripts.py:747 -- Local node IP: 10.155.169.193
2023-10-26 13:49:57,859 SUCC scripts.py:755 -- --------------------
2023-10-26 13:49:57,859 SUCC scripts.py:756 -- Ray runtime started.
2023-10-26 13:49:57,859 SUCC scripts.py:757 -- --------------------
2023-10-26 13:49:57,859 INFO scripts.py:759 -- To terminate the Ray runtime, run
2023-10-26 13:49:57,859 INFO scripts.py:760 --   ray stop

2023-10-26 13:49:57,646 INFO scripts.py:747 -- Local node IP: 10.155.169.193
2023-10-26 13:49:57,859 SUCC scripts.py:755 -- --------------------
2023-10-26 13:49:57,859 SUCC scripts.py:756 -- Ray runtime started.
2023-10-26 13:49:57,859 SUCC scripts.py:757 -- --------------------
2023-10-26 13:49:57,859 INFO scripts.py:759 -- To terminate the Ray runtime, run
2023-10-26 13:49:57,859 INFO scripts.py:760 --   ray stop

2023-10-26 13:49:59,017 INFO worker.py:842 -- Connecting to existing Ray cluster at address: 10.155.164.0:23706
{'node_ip_address': '10.155.169.193', 'raylet_ip_address': '10.155.169.193', 'redis_address': '10.155.164.0:23706', 'object_store_address': '/tmp/ray/session_2023-10-26_13-49-45_506013_4890/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-10-26_13-49-45_506013_4890/sockets/raylet', 'webui_url': '10.155.164.0:8265', 'session_dir': '/tmp/ray/session_2023-10-26_13-49-45_506013_4890', 'metrics_export_port': 64170, 'node_id': 'ffbb08bcc9ea01ba0d3afb234d40a9ff60994ca0252bd0a9b8caacf6'}
INFO:tensorflow:Assets written to: ram://748de196-fd38-44b1-9f13-a22f54440ebb/assets
INFO:tensorflow:Assets written to: ram://42eefa9b-3674-4ebd-ad56-16cbd4799255/assets
(pid=5180, ip=10.155.171.50) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5180, ip=10.155.171.50)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:01.264656: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:01.264697: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://ee6efe6b-c639-4837-81ea-ef3c4d408956/assets
(pid=5374, ip=10.155.164.0) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5374, ip=10.155.164.0)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:01.862198: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:01.862260: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://e016a29b-9251-47de-ab32-99c6bc3af8b0/assets
(pid=5194, ip=10.155.187.221) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5194, ip=10.155.187.221)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:02.672441: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:02.672488: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://3d59f768-cac7-4cff-bfe1-7f7d2da4b277/assets
(pid=5207, ip=10.155.181.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5207, ip=10.155.181.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:03.236888: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:03.236931: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://28ddf9e3-4f86-4ba7-94d7-1f33b7716a14/assets
(pid=9132) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=9132)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:03.747569: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:03.747614: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:03.747642: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-171-50): /proc/driver/nvidia/version does not exist
(Worker pid=5180, ip=10.155.171.50) 2023-10-26 13:50:03.747894: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5180, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=9132) 2023-10-26 13:50:03.926312: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=9132) 2023-10-26 13:50:03.926365: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://6b9fa719-d378-4ab3-900a-d102d4b0983a/assets
(pid=5228, ip=10.155.171.50) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5228, ip=10.155.171.50)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:04.472262: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:04.472306: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:04.472335: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-164-0): /proc/driver/nvidia/version does not exist
(Worker pid=5374, ip=10.155.164.0) 2023-10-26 13:50:04.472584: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5374, ip=10.155.164.0) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:04.715212: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:04.715270: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://20116bbe-6866-4775-9056-85e97861a852/assets
(pid=5249, ip=10.155.171.50) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5249, ip=10.155.171.50)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:05.168101: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:05.168141: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:05.168168: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-187-221): /proc/driver/nvidia/version does not exist
(Worker pid=5194, ip=10.155.187.221) 2023-10-26 13:50:05.168415: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5194, ip=10.155.187.221) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:05.269847: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:05.269904: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:05.580677: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:05.580722: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:05.580750: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-181-21): /proc/driver/nvidia/version does not exist
(Worker pid=5207, ip=10.155.181.21) 2023-10-26 13:50:05.581008: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5207, ip=10.155.181.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://dd6e5cff-07b9-46d8-8127-2e334822efd4/assets
(pid=9176) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=9176)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
INFO:tensorflow:Assets written to: ram://7141830c-e8d7-4787-b9f4-dd6eb551d5d5/assets
(Worker pid=9176) 2023-10-26 13:50:06.455111: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=9176) 2023-10-26 13:50:06.455155: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=5273, ip=10.155.181.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5273, ip=10.155.181.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=9132) 2023-10-26 13:50:06.685332: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=9132) 2023-10-26 13:50:06.685391: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=9132) 2023-10-26 13:50:06.685435: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-169-193): /proc/driver/nvidia/version does not exist
(Worker pid=9132) 2023-10-26 13:50:06.685737: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=9132) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:06.818439: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:06.818482: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:07.172407: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:07.172451: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:07.172479: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-171-50): /proc/driver/nvidia/version does not exist
(Worker pid=5228, ip=10.155.171.50) 2023-10-26 13:50:07.172731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5228, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://a68a989f-3d35-4e32-aa47-591e1aeca66a/assets
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:07.792797: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:07.792839: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:07.792868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-171-50): /proc/driver/nvidia/version does not exist
(Worker pid=5249, ip=10.155.171.50) 2023-10-26 13:50:07.793115: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5249, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=9213) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=9213)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
INFO:tensorflow:Assets written to: ram://3ae61665-e586-430e-8b4b-93c6dd608674/assets
(Worker pid=9213) 2023-10-26 13:50:08.206110: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=9213) 2023-10-26 13:50:08.206147: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=5222, ip=10.155.172.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5222, ip=10.155.172.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5222, ip=10.155.172.21) 2023-10-26 13:50:08.473534: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5222, ip=10.155.172.21) 2023-10-26 13:50:08.473575: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:08.720859: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:08.720902: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:08.720932: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-181-21): /proc/driver/nvidia/version does not exist
(Worker pid=5273, ip=10.155.181.21) 2023-10-26 13:50:08.721173: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5273, ip=10.155.181.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://1b3c1cd8-2521-4470-9f59-85009a285fa8/assets
(pid=9266) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=9266)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=9176) 2023-10-26 13:50:10.236178: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=9176) 2023-10-26 13:50:10.236239: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=9176) 2023-10-26 13:50:10.236277: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-169-193): /proc/driver/nvidia/version does not exist
(Worker pid=9176) 2023-10-26 13:50:10.236908: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=9176) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=9266) 2023-10-26 13:50:10.413099: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory

*** WARNING: skipped 61042 bytes of output ***

(Worker pid=5619, ip=10.155.171.50) 2023-10-26 13:50:52.292913: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5619, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:52.402541: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:52.402584: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://33b6e429-b3d3-4b0a-85a3-65ab0bb076cc/assets
(pid=5515, ip=10.155.181.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5515, ip=10.155.181.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:52.995453: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:52.995492: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://a7a9551c-fc35-41a9-b5e0-8e6fe1b7e7ac/assets
(pid=5719, ip=10.155.171.50) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5719, ip=10.155.171.50)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
INFO:tensorflow:Assets written to: ram://e268cc86-820c-40d6-89fd-de3bbc1cb57d/assets
(Worker pid=5662, ip=10.155.171.50) 2023-10-26 13:50:53.801841: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5662, ip=10.155.171.50) 2023-10-26 13:50:53.801902: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5662, ip=10.155.171.50) 2023-10-26 13:50:53.801946: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-171-50): /proc/driver/nvidia/version does not exist
(Worker pid=5662, ip=10.155.171.50) 2023-10-26 13:50:53.802268: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5662, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:53.758750: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:53.758845: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=5559, ip=10.155.187.221) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5559, ip=10.155.187.221)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:54.320375: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:54.320420: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://d6d73b9d-afce-47a3-b0e0-14aa7875730c/assets
2023-10-26 13:50:54,535 WARNING worker.py:1245 -- WARNING: 17 PYTHON worker processes have been started on node: ffbb08bcc9ea01ba0d3afb234d40a9ff60994ca0252bd0a9b8caacf6 with address: 10.155.169.193. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:54.483540: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:54.483576: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:54.483605: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-164-0): /proc/driver/nvidia/version does not exist
(Worker pid=5634, ip=10.155.164.0) 2023-10-26 13:50:54.483858: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5634, ip=10.155.164.0) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=5516, ip=10.155.172.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5516, ip=10.155.172.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:54.842045: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:54.842085: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:54.858471: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:54.858512: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:54.858539: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-181-21): /proc/driver/nvidia/version does not exist
(Worker pid=5515, ip=10.155.181.21) 2023-10-26 13:50:54.858767: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5515, ip=10.155.181.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://e27a94de-b06f-42ba-8526-3c2e074e3f9c/assets
(pid=10414) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=10414)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=10414) 2023-10-26 13:50:55.796590: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=10414) 2023-10-26 13:50:55.796632: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://43765f18-dc48-459a-9a0c-a35f575e50ac/assets
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:55.901215: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:55.901277: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:55.901324: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-171-50): /proc/driver/nvidia/version does not exist
(Worker pid=5719, ip=10.155.171.50) 2023-10-26 13:50:55.901609: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5719, ip=10.155.171.50) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=5553, ip=10.155.172.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5553, ip=10.155.172.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:56.448293: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:56.448333: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:56.527104: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:56.527153: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:56.527183: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-187-221): /proc/driver/nvidia/version does not exist
(Worker pid=5559, ip=10.155.187.221) 2023-10-26 13:50:56.527420: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5559, ip=10.155.187.221) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://30cc441f-daf8-493b-b8c5-aa719c4d33c9/assets
(pid=5606, ip=10.155.187.221) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5606, ip=10.155.187.221)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
2023-10-26 13:50:56,980 WARNING worker.py:1245 -- WARNING: 18 PYTHON worker processes have been started on node: ffbb08bcc9ea01ba0d3afb234d40a9ff60994ca0252bd0a9b8caacf6 with address: 10.155.169.193. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:57.057287: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:57.057341: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:57.057390: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-172-21): /proc/driver/nvidia/version does not exist
(Worker pid=5516, ip=10.155.172.21) 2023-10-26 13:50:57.057636: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5516, ip=10.155.172.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:57.177151: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:57.177194: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:tensorflow:Assets written to: ram://630b63d8-d7f3-48f9-b44c-8314e6508053/assets
2023-10-26 13:50:57,879 WARNING worker.py:1245 -- WARNING: 19 PYTHON worker processes have been started on node: ffbb08bcc9ea01ba0d3afb234d40a9ff60994ca0252bd0a9b8caacf6 with address: 10.155.169.193. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(pid=10457) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=10457)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=10457) 2023-10-26 13:50:58.513114: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=10457) 2023-10-26 13:50:58.513172: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:58.560672: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:58.560716: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:58.560743: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-172-21): /proc/driver/nvidia/version does not exist
(Worker pid=5553, ip=10.155.172.21) 2023-10-26 13:50:58.560993: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5553, ip=10.155.172.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=10414) 2023-10-26 13:50:58.651538: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=10414) 2023-10-26 13:50:58.651588: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=10414) 2023-10-26 13:50:58.651629: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-169-193): /proc/driver/nvidia/version does not exist
(Worker pid=10414) 2023-10-26 13:50:58.651921: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=10414) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:tensorflow:Assets written to: ram://e9ba9d01-e354-4eb4-b847-d604a49da750/assets
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:59.303879: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:59.303928: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:59.303959: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-187-221): /proc/driver/nvidia/version does not exist
(Worker pid=5606, ip=10.155.187.221) 2023-10-26 13:50:59.304199: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5606, ip=10.155.187.221) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(pid=10493) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=10493)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=10493) 2023-10-26 13:50:59.654789: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=10493) 2023-10-26 13:50:59.654837: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(pid=5633, ip=10.155.172.21) /databricks/python/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4)
(pid=5633, ip=10.155.172.21)   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:50:59.926667: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:50:59.926722: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
(Worker pid=10457) 2023-10-26 13:51:01.404304: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=10457) 2023-10-26 13:51:01.404363: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=10457) 2023-10-26 13:51:01.404397: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-169-193): /proc/driver/nvidia/version does not exist
(Worker pid=10457) 2023-10-26 13:51:01.404647: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=10457) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:51:01.927520: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:51:01.927564: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:51:01.927594: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-172-21): /proc/driver/nvidia/version does not exist
(Worker pid=5633, ip=10.155.172.21) 2023-10-26 13:51:01.927858: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=5633, ip=10.155.172.21) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(Worker pid=10493) 2023-10-26 13:51:02.332708: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
(Worker pid=10493) 2023-10-26 13:51:02.332744: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
(Worker pid=10493) 2023-10-26 13:51:02.332772: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (1018-223841-bhu8vvu6-10-155-169-193): /proc/driver/nvidia/version does not exist
(Worker pid=10493) 2023-10-26 13:51:02.332998: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(Worker pid=10493) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
`

RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>) any help to fix this issue please, because when using backend="spark", it requires more resources. it seems thatspark backend does not distribute workloads across multiple nodes as done with ray. thank you

fatenlouati commented 1 year ago

update this is the track:

--> 314         est = Estimator.from_keras(model_creator=model, workers_per_node=5)

/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/estimator.py in from_keras(model_creator, config, verbose, workers_per_node, compile_args_creator, backend, cpu_binding, log_to_driver, model_dir, **kwargs)
     69         if backend in {"ray", "horovod"}:
     70             from bigdl.orca.learn.tf2.ray_estimator import TensorFlow2Estimator
---> 71             return TensorFlow2Estimator(model_creator=model_creator, config=config,
     72                                         verbose=verbose, workers_per_node=workers_per_node,
     73                                         backend=backend, compile_args_creator=compile_args_creator,

/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/ray_estimator.py in __init__(self, model_creator, compile_args_creator, config, verbose, backend, workers_per_node, cpu_binding)
    116             urls = ["{ip}:{port}".format(ip=ips[i], port=ports[i])
    117                     for i in range(len(self.remote_workers))]
--> 118             ray.get([worker.setup.remote() for worker in self.remote_workers])
    119             # Get setup tasks in order to throw errors on failure
    120             ray.get([

/databricks/python/lib/python3.8/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105         return func(*args, **kwargs)
    106 
    107     return wrapper

/databricks/python/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
   1711                     worker.core_worker.dump_object_store_memory_usage()
   1712                 if isinstance(value, RayTaskError):
-> 1713                     raise value.as_instanceof_cause()
   1714                 else:
   1715                     raise value

RayTaskError(RuntimeError): ray::Worker.setup() (pid=5180, ip=10.155.171.50, repr=<bigdl.orca.learn.dl_cluster.Worker object at 0x7fb3cc861160>)
  File "/databricks/python/lib/python3.8/site-packages/bigdl/orca/learn/tf2/tf_runner.py", line 271, in setup
    tf.config.threading.set_inter_op_parallelism_threads(self.inter_op_parallelism)
  File "/databricks/python/lib/python3.8/site-packages/tensorflow/python/framework/config.py", line 144, in set_inter_op_parallelism_threads
    context.context().inter_op_parallelism_threads = num_threads
  File "/databricks/python/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 1841, in inter_op_parallelism_threads
    raise RuntimeError(
RuntimeError: Inter op parallelism cannot be modified after initialization.
sgwhat commented 1 year ago

Hi @fatenlouati , I tried to reproduce the error in your code, but was not successful. I was able to successfully run a TensorFlow Estimator with Ray backend application on Databricks. Could you please share your Databricks cluster configuration and some sample code? This would be very helpful for us to analyze and resolve your issue😄.

fatenlouati commented 1 year ago

Thank ou @sgwhat,this is my cluster configuration. For the code I train my model (RL) with multiple iterations. image

image

when i run with backend="spark", after some iterations it stop with this error: org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage. what could be the problem?

fatenlouati commented 1 year ago

my be because I use 14'days free trial, there are some limits?

sgwhat commented 1 year ago

Hi @fatenlouati ,

I believe this error is caused by either cluster resources limits or non-uniform input data, possibly related to the free trial. Could you provide a portion of your code to help us pinpoint the root cause of the issue?

By the way, this error may also be related to your configuration, you may refer to our configuration https://bigdl.readthedocs.io/en/latest/doc/UserGuide/databricks.html#set-spark-configuration to restart the cluster and also refer to this known issue to solve the error you met with ray estimator.