Open tianlinzx opened 2 years ago
Here is the source code qlib_ray_gat_158_tune_v2.py.zip .
Hey @tianlinzx do you mind also sharing your cluster configuration or the cluster yaml that you used? Thanks!
And is there anything else shown in the stdout when tune hangs?
@amogkam We just use the default helm file https://github.com/ray-project/ray/blob/master/deploy/charts/ray/Chart.yaml and made some changes to default values https://github.com/ray-project/ray/blob/master/deploy/charts/ray/values.yaml.
The is no other error msg,that's why it confused me .
BTW, I observed the job hangs but the process is still running . Is there any mechanism to enforce killing process when job hangs or timeout ?
@amogkam Here is the configuration file cluster.txt .
Hey @tianlinzx thanks for providing all the info! I tried out your code and was able to reproduce the hanging behavior after 10 successful trials:
Looking at the call stack, it seems that the trials are in fact hanging, and in particular on termination of joblib multiprocessing during dataset creation:
Thread 0x7F3D79E3B700 (active): "Thread-3"
poll (multiprocessing/popen_fork.py:28)
wait (multiprocessing/popen_fork.py:48)
join (multiprocessing/process.py:140)
_terminate_pool (multiprocessing/pool.py:617)
__call__ (multiprocessing/util.py:201)
terminate (multiprocessing/pool.py:548)
terminate (joblib/pool.py:329)
terminate (joblib/_parallel_backends.py:243)
terminate (joblib/_parallel_backends.py:476)
_terminate_backend (joblib/parallel.py:759)
__call__ (joblib/parallel.py:1066)
dataset_processor (qlib/data/data.py:526)
dataset (qlib/data/data.py:787)
features (qlib/data/data.py:1051)
load_group_df (qlib/data/dataset/loader.py:212)
<dictcomp> (qlib/data/dataset/loader.py:137)
load (qlib/data/dataset/loader.py:137)
setup_data (qlib/data/dataset/handler.py:141)
setup_data (qlib/data/dataset/handler.py:570)
__init__ (qlib/data/dataset/handler.py:97)
__init__ (qlib/data/dataset/handler.py:434)
__init__ (qlib/contrib/data/handler.py:178)
init_instance_by_config (qlib/utils/__init__.py:336)
__init__ (qlib/data/dataset/__init__.py:115)
__init__ (qlib/data/dataset/__init__.py:565)
init_instance_by_config (qlib/utils/__init__.py:336)
__init__ (tune_job.py:44)
train_model (tune_job.py:171)
_trainable_func (ray/tune/function_runner.py:639)
_resume_span (ray/util/tracing/tracing_helper.py:462)
entrypoint (ray/tune/function_runner.py:352)
run (ray/tune/function_runner.py:277)
_bootstrap_inner (threading.py:926)
_bootstrap (threading.py:890)
It seems like the qlib
library is using joblib.Parallel
during dataset_processing https://github.com/microsoft/qlib/blob/v0.8.4/qlib/data/data.py#L526.
qlib
also sets the joblib
backend to multiprocessing: https://github.com/microsoft/qlib/blob/v0.8.4/qlib/config.py#L123, resulting in multiprocessing
being used within each Ray actor.
In general, it's not best practice to mix multiprocessing
with Ray (https://discuss.ray.io/t/best-solution-to-have-multiprocess-working-in-actor/2165), so I would recommend seeing if there's a way in qlib
to disable multiprocessing, or instead of using the multiprocessing backend for joblib.Parallel
, to use the loky
backend instead, which is the default.
Another option is to use the Ray backend for joblib
(more info here: https://docs.ray.io/en/latest/ray-more-libs/joblib.html) which would more seamlessly fit with Tune.
@amogkam Thanks for your reply. Will try to modify the qlib library later. Temporally close this issue.
@amogkam Hi Amog ,I have modified my code but the job hangs again after 8 rounds of tune job trials. My colleage will upload the lastest source code and screen dump later on . Would you please have look ?
@amogkam Hi Amog. As you suggested, we first tried to use the loky backend for qlib in Ray tune, but we abandoned this solution soon due to a serious memory leak. Then we chose to use the Ray backend for joblib with Tune, however the jobs still hang after 7 successful trials(please ignore the error one). Would you please take a look at our code to see if there's anything going wrong? And could you please share that where and how we should inspect the whole call stack? here are our code and trial results: qlib_ray_gat_158_tune.zip Thanks in advance!
Hey @tianlinzx @1059692261, while Tune is hanging, can you do a ray stack
on each of the nodes and share the output?
@amogkam Here is the result of ray stack. reproduce code is qlib_ray_gat_158_tune.zip
Hey @tianlinzx- from the second screenshot it looks like joblib is still using the multiprocessing backend. It seems like it's being hardcoded by qlib here: https://github.com/microsoft/qlib/blob/v0.8.4/qlib/config.py#L123.
What happened + What you expected to happen
I use tune to optimize hyperparameters ,the num_samples is set to 200, and it hangs out after 8 trials. I have tried this serveral times. `(run pid=1032772) Number of trials: 42/200 (30 PENDING, 4 RUNNING, 8 TERMINATED) (run pid=1032772) +-------------------------+------------+--------------------+-----------+---------------+--------+--------------+--------+------------------+ (run pid=1032772) | Trial name | status | loc | dropout | hidden_size | lr | num_layers | iter | total time (s) | (run pid=1032772) |-------------------------+------------+--------------------+-----------+---------------+--------+--------------+--------+------------------| (run pid=1032772) | train_model_6ba95_00001 | RUNNING | 10.244.1.242:59816 | 0.5 | 100 | 0.0001 | 1 | | | (run pid=1032772) | train_model_6ba95_00002 | RUNNING | 10.244.1.241:49849 | 0.8 | 200 | 1e-05 | 4 | | | (run pid=1032772) | train_model_6ba95_00008 | RUNNING | 10.244.1.241:68242 | 0.6 | 10 | 1e-05 | 3 | | | (run pid=1032772) | train_model_6ba95_00011 | RUNNING | 10.244.1.242:90467 | 0.5 | 200 | 0.001 | 1 | | | (run pid=1032772) | train_model_6ba95_00012 | PENDING | | 0.5 | 500 | 1e-06 | 3 | | | (run pid=1032772) | train_model_6ba95_00013 | PENDING | | 0.6 | 300 | 1e-06 | 4 | | | (run pid=1032772) | train_model_6ba95_00014 | PENDING | | 0.6 | 500 | 0.0001 | 2 | | | (run pid=1032772) | train_model_6ba95_00015 | PENDING | | 0.5 | 100 | 0.001 | 1 | | | (run pid=1032772) | train_model_6ba95_00016 | PENDING | | 0.7 | 10 | 1e-06 | 3 | | | (run pid=1032772) | train_model_6ba95_00017 | PENDING | | 0.5 | 300 | 0.0001 | 4 | | | (run pid=1032772) | train_model_6ba95_00018 | PENDING | | 0.8 | 10 | 1e-06 | 1 | | | (run pid=1032772) | train_model_6ba95_00019 | PENDING | | 0.6 | 500 | 1e-05 | 1 | | | (run pid=1032772) | train_model_6ba95_00000 | TERMINATED | 10.244.1.242:59782 | 0.5 | 500 | 1e-05 | 4 | 1 | 834.522 | (run pid=1032772) | train_model_6ba95_00003 | TERMINATED | 10.244.1.241:49850 | 0.4 | 100 | 0.0001 | 2 | 1 | 496.198 | (run pid=1032772) | train_model_6ba95_00004 | TERMINATED | 10.244.1.241:56024 | 0.8 | 100 | 0.0001 | 3 | 1 | 432.626 | (run pid=1032772) | train_model_6ba95_00005 | TERMINATED | 10.244.1.242:66059 | 0.4 | 10 | 1e-06 | 4 | 1 | 432.914 | (run pid=1032772) | train_model_6ba95_00006 | TERMINATED | 10.244.1.241:62137 | 0.6 | 500 | 0.001 | 2 | 1 | 544.445 | (run pid=1032772) | train_model_6ba95_00007 | TERMINATED | 10.244.1.242:72172 | 0.6 | 100 | 0.001 | 4 | 1 | 456.016 | (run pid=1032772) | train_model_6ba95_00009 | TERMINATED | 10.244.1.242:78266 | 0.8 | 100 | 0.0001 | 1 | 1 | 381.592 | (run pid=1032772) | train_model_6ba95_00010 | TERMINATED | 10.244.1.242:84370 | 0.6 | 10 | 0.0001 | 3 | 1 | 399.034 | (run pid=1032772) +-------------------------+------------+--------------------+-----------+---------------+--------+--------------+--------+------------------+
(run pid=1032772) ... 22 more trials not shown (22 PENDING)`
Versions / Dependencies
Ray 1.11.0 or Ray 1.12.0RC1 pyqlib
Reproduction script