Open movy opened 11 months ago
@iycheng do you have any ideas about the Core failure?
As a temporary solution I wrapped .fit()
with try...except
and at least I can terminate such problematic trials:
try:
results = tune.Tuner(tune.with_resources(backtest_rungs, param_space=hyperparams, tune_config=tune_config, run_config=run_config).fit()
except Exception as e:
print("❌ Exception in worker:", e)
train.report({})
ray.shutdown()
os.kill(os.getpid(), signal.SIGTERM)
So far been running for nearly 48 hours on 4 nodes, did not see a single hang, so I assume such termination works as expected.
It looks like there are actually two issues here:
I can look into 1 but I don't know about 2. @matthewdeng can you find someone for 2?
By the way, @movy you can pass the environment variable RAY_TASK_MAX_RETRIES=0
when starting your driver to override the default number of retries. The only problem is that this will not work if libraries override the number of retries (which Tune may do internally).
It might also be helpful to find out what tasks are failing. Ray's state API might help with that. Here's the one for tasks (actors should already be in the dashboard).
@movy It's a little hard to pinpoint the issue right now without a reliable repro -- we have this release test that runs 10k short-lived trials and has been pretty stable: https://github.com/ray-project/ray/blob/master/release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py#L5-L6
When the next hang happens, would it be possible for you to try figuring out where in the driver code things are hanging? The Ray Dashboard would be useful to see determine whether the actor is failing to be terminated, or if it's just the driver script getting stuck at a certain place.
Once you encounter the hang, I can also help to debug more via Slack or a call.
Also, have you tried upgrading to the latest version of Ray? In particular, Ray 2.7+ introduced some major internal refactors.
ray stack
is also useful for pinpointing where in python/C++ the driver and workers are hanging.
Thanks for all your input,
I tried [RAY_TASK_MAX_RETRIES=0]
-- did not help,
@stephanie-wang how to use ray stack
? I have a hung ray process with the message
[2023-12-14 08:56:35,320 E 3620529 3620803] core_worker.cc:593: :info_message: Attempting to recover 6 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
but ray stack
shows:
Stack dump for ubuntu 3756186 1.8 1.0 22599752 709792 pts/51 SNl+ 08:52 3:09 ray::IDLE
Process 3756186: ray::IDLE
Python v3.11.6 (/usr/bin/python3.11)
Error: Failed to merge native and python frames (Have 1 native and 2 python)
Stack dump for ubuntu 3779883 1.4 0.1 21825680 118900 pts/51 SNl+ 08:54 2:24 ray::IDLE
Process 3779883: ray::IDLE
Python v3.11.6 (/usr/bin/python3.11)
Error: Failed to merge native and python frames (Have 1 native and 2 python)
The core failure is fixed. Removing a release blocker. Seems like there's still discussion going on, so I will keep the issue open (@stephanie-wang let us know the priority of the other issue!)
@movy sorry for the unrelated message (and apologies to @rkooo567 and @anyscalesam), but can you draft some examples of how backtesters like bt
or backtesting.py
can work with Ray-Tune? There is a lack of guide on how it can be done tweaking strategies (parameters or worse tree-based strategy) to optimize returns.
What happened + What you expected to happen
I run a simple hyperparameters optimisation (backtesting) using Ray Tune and Optuna. Tasks are very short (1-3 seconds long, 1-5 trials per iteration with early termination, 300-1500 iterations per backtest). The simplified code is attached below, but the problem is not with the code but with how Ray handles its own internal errors. Everything was working very stable until about Ray v. 2.4, and since then I observe this problematic behaviour on each version, including current nightly wheels. I used to stay on v 2.3 because of this, but with recent upgrade to Python 3.11 I cannot do that any longer, hence this report.
During the trials occasionally (like once in 10k trials) a raylet crashes with unexplained error related to Ray serialization mechanisms, but other raylets continue and all trials eventually end, moving on to the next backtest.
Once in a while however, main Ray process gets stuck with the following error, i.e. instead of exiting Ray hangs until I notice it and press Ctrl-C, which blocks the whole pipeline. I found no setting to let Ray simply quit in such case. In the error message below Ray offers to set
max_retries=0
to avoid resubmitting, but I could not find a way to pass this parameter to a worker defined as a function instead of a class (i.e. cannot use a@ray.remote()
decorator). As a last resort, I tried modifying Ray source code directly to changeDEFAULT_TASK_MAX_RETRIES
to 0: https://github.com/ray-project/ray/blob/27a88bad7e827d86399b36c3e35f3f2a21a7fa77/python/ray/_private/ray_constants.py#L425 yet Ray insists on resubmitting the tasks and hangs with the same error.I guess my main question is not how to avoid this error but rather how to recover from it and move on or at least exit main process even while sacrificing some trials?
Ray is run on powerful baremetal servers w/o Docker, i.e. there's enough ram/cpu/disk space without external constraints. I run the code with same Ray/Python versions on multiple machines via separate ray instances or in cluster mode, and still can encounter this problematic behaviour regardless of execution mode, i.e. networking problems are out of the question also.
Versions / Dependencies
As I've been experiencing this behaviour with current 2.8.0 release (and all releases after 2.3.0), so I tried installing Ray from nightly wheels via
pip install -U "ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp311-cp311-manylinux2014_x86_64.whl"
The result is the same unfortunately.
✗ ray --version ray, version 3.0.0.dev0 ✗ python --version Python 3.11.6 ✗ uname -a Linux backtest 6.5.0-1008-oem #8-Ubuntu SMP PREEMPT_DYNAMIC Fri Nov 10 13:08:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Reproduction script
(not a full reproduction unfortunatelly, as it calls external storage to fetch candles and run backtest function. Once again, it works fine 99% of the time, but when it hangs, I cannot find a way to recover from the hang.
Please note that I've disabled checkpoints, but with checkpoints the behaviour was the same.
Issue Severity
High: It blocks me from completing my task.