ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.31k stars 5.5k forks source link

ray.tune.error.TuneError: ('Trials did not complete', [A3C_A3C_four_way_train-v0_ff741_00000]) #46501

Open SExpert12 opened 3 weeks ago

SExpert12 commented 3 weeks ago

What happened + What you expected to happen

I am trying to run A3C algorithm. But I got this error.

2024-07-09 09:46:19.039023: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ryzen/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/cv2/../../lib64: 2024-07-09 09:46:19.039133: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/ryzen/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/cv2/../../lib64: 2024-07-09 09:46:19.039144: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. /home/ryzen/carla_out -------------------------------------------- 2024-07-09 09:46:19,916 INFO resource_spec.py:231 -- Starting Ray with 16.26 GiB memory available for workers and up to 8.15 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2024-07-09 09:46:20,399 INFO services.py:1193 -- View the Ray dashboard at localhost:8265 2024-07-09 09:46:21,525 WARNING sample.py:29 -- DeprecationWarning: wrapping <function at 0x7f9bffafebf8> with tune.function() is no longer needed == Status == Memory usage on this node: 5.3/31.2 GiB PopulationBasedTraining: 0 checkpoints, 0 perturbs Resources requested: 3/16 CPUs, 0/1 GPUs, 0.0/16.26 GiB heap, 0.0/5.62 GiB objects (0/1.0 GPUType:G) Result logdir: /home/ray_results/A3C_Four_way Number of trials: 1 (1 RUNNING) +---------------------------------------+----------+-------+ | Trial name | status | loc | |---------------------------------------+----------+-------| | A3C_A3C_four_way_train-v0_ff741_00000 | RUNNING | | +---------------------------------------+----------+-------+

2024-07-09 09:46:22,997 ERROR trial_runner.py:523 -- Trial A3C_A3C_four_way_train-v0_ff741_00000: Error processing event. Traceback (most recent call last): File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/worker.py", line 1538, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ModuleNotFoundError): ray::A3C.train() (pid=10786, ip=192.168.15.93) File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 474, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 478, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 88, in init Trainer.init(self, config, env, logger_creator) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 479, in init super().init(config, logger_creator) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/tune/trainable.py", line 245, in init self.setup(copy.deepcopy(self.config)) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 564, in setup self.env_creator = _global_registry.get(ENV_CREATOR, env) File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/tune/registry.py", line 130, in get return pickle.loads(value) ModuleNotFoundError: No module named 'macad_agents' == Status == Memory usage on this node: 5.4/31.2 GiB PopulationBasedTraining: 0 checkpoints, 0 perturbs Resources requested: 0/16 CPUs, 0/1 GPUs, 0.0/16.26 GiB heap, 0.0/5.62 GiB objects (0/1.0 GPUType:G) Result logdir: /home/ryzen/ray_results/A3C_Four_way Number of trials: 1 (1 ERROR) +---------------------------------------+----------+-------+ | Trial name | status | loc | |---------------------------------------+----------+-------| | A3C_A3C_four_way_train-v0_ff741_00000 | ERROR | | +---------------------------------------+----------+-------+ Number of errored trials: 1 +---------------------------------------+--------------+--------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |---------------------------------------+--------------+--------------------------------------------------------------------------------------------------------| | A3C_A3C_four_way_train-v0_ff741_00000 | 1 | /home/ryzen/ray_results/A3C_Four_way/A3C_A3C_four_way_train-v0_0_2024-07-09_09-46-2145o3u74m/error.txt | +---------------------------------------+--------------+--------------------------------------------------------------------------------------------------------+

Traceback (most recent call last): File "Train_A3C_four_way.py", line 440, in checkpoint_at_end = True,
File "/home/miniconda3/envs/macad-gym-benchmarking/lib/python3.7/site-packages/ray/tune/tune.py", line 356, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [A3C_A3C_four_way_train-v0_ff741_00000])

Versions / Dependencies

RAy version - 0.8.7 Python-3.7.1 Ubuntu

Reproduction script

Running A3C algorithm from this repo https://github.com/T3AS/Benchmarking-QRS-2022/tree/master

Issue Severity

High: It blocks me from completing my task.

anyscalesam commented 2 weeks ago

@SExpert12 please upgrade to latest Ray version and see if you can repro - 087 is nearly 4Y old at this point.

SExpert12 commented 2 weeks ago

Thanks.