Closed richardliaw closed 4 years ago
Hey look, the entire cluster died:
2020-03-28 11:49:19,414 INFO updater.py:264 -- NodeUpdater: i-0bff9d862ad11f9be: Running python ~/cifar_pytorch_example.py --use-gpu --num-epochs 20 --num-workers 4 --address='auto' on 34.228.215.12...
2020-03-28 18:49:21,144 WARNING worker.py:792 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
E0328 18:49:21.674247 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 1, num_retries = 5)
E0328 18:49:22.174716 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 2, num_retries = 5)
E0328 18:49:22.674959 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 3, num_retries = 5)
E0328 18:49:23.175170 81818 raylet_client.cc:74] Retrying to connect to socket for pathname /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet (num_attempts = 4, num_retries = 5)
F0328 18:49:23.675355 81818 raylet_client.cc:83] Could not connect to socket /tmp/ray/session_2020-03-28_18-07-49_270759_76175/sockets/raylet
*** Check failure stack trace: ***
@ 0x7fbc5da7753d google::LogMessage::Fail()
@ 0x7fbc5da789ac google::LogMessage::SendToLog()
@ 0x7fbc5da77219 google::LogMessage::Flush()
@ 0x7fbc5da77431 google::LogMessage::~LogMessage()
@ 0x7fbc5d802ef9 ray::RayLog::~RayLog()
@ 0x7fbc5d6408f3 ray::raylet::RayletConnection::RayletConnection()
@ 0x7fbc5d641239 ray::raylet::RayletClient::RayletClient()
@ 0x7fbc5d5ed2ab ray::CoreWorker::CoreWorker()
@ 0x7fbc5d55e94e __pyx_tp_new_3ray_7_raylet_CoreWorker()
@ 0x560d1aac30e5 type_call
@ 0x560d1aa35bcb _PyObject_FastCallDict
@ 0x560d1aac2f4e call_function
@ 0x560d1aae794a _PyEval_EvalFrameDefault
@ 0x560d1aabc62e _PyEval_EvalCodeWithName
@ 0x560d1aabd1cf fast_function
@ 0x560d1aac2ed5 call_function
@ 0x560d1aae8715 _PyEval_EvalFrameDefault
@ 0x560d1aabc206 _PyEval_EvalCodeWithName
@ 0x560d1aabd1cf fast_function
@ 0x560d1aac2ed5 call_function
@ 0x560d1aae8715 _PyEval_EvalFrameDefault
@ 0x560d1aabdcb9 PyEval_EvalCodeEx
@ 0x560d1aabea4c PyEval_EvalCode
@ 0x560d1ab3ac44 run_mod
@ 0x560d1ab3b041 PyRun_FileExFlags
@ 0x560d1ab3b244 PyRun_SimpleFileExFlags
@ 0x560d1ab3ed24 Py_Main
@ 0x560d1aa0675e main
@ 0x7fbcbe51fb97 __libc_start_main
@ 0x560d1aaee47b (unknown)
Shared connection to 34.228.215.12 closed.
Error: Command failed:
Should this be a release blocker?
Same here. Did you find a way around this?
I think this should be resolved on ray==1.0
What is the problem?
py 3.7, torch 1.4, linux 18
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
The script is here:
ray submit [cluster] cifar_pytorch_example.py --args="--use-gpu --tune --num-epochs 20 --num-workers 4 --address='auto'"
https://github.com/richardliaw/ray/blob/improve_tune_trainable/python/ray/util/sgd/torch/examples/cifar_pytorch_example.py
Cluster yaml:
If we cannot run your script, we cannot fix your issue.