I am trying to build a manual cluster of the machines with IP Addresses. However, When I tried to run the A2C algorithm on the cluster I got an error message from one of the workers complaining about ModuleNotFoundError: No module named "main". Here the main module is my custom gym environment. It looks like ray could not able to sync the files between different nodes.
Here is the complete traceback. owl is my worker hostname.
RUNNING trials: Error processing event. Traceback (most recent call last): File "/home/mayankp/ray/python/ray/tune/trial_runner.py", line 261, in _process_events result = self.trial_executor.fetch_result(trial) File "/home/mayankp/ray/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result result = ray.get(trial_future[0]) File "/home/mayankp/ray/python/ray/worker.py", line 2392, in get raise value ray.worker.RayTaskError: ray_A2CAgent:train() (pid=17551, host=wsl) File "/home/mayankp/ray/python/ray/rllib/agents/agent.py", line 278, in train result = Trainable.train(self) File "/home/mayankp/ray/python/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/mayankp/ray/python/ray/rllib/agents/a3c/a3c.py", line 68, in _train self.optimizer.step() File "/home/mayankp/ray/python/ray/rllib/optimizers/sync_samples_optimizer.py", line 48, in step e.sample.remote() for e in self.remote_evaluators ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=16361, host=owl) File "/home/mayank/ray/python/ray/utils.py", line 404, in _wrapper return orig_attr(*args, **kwargs) File "pyarrow/_plasma.pyx", line 556, in pyarrow._plasma.PlasmaClient.get File "pyarrow/serialization.pxi", line 448, in pyarrow.lib.deserialize File "pyarrow/serialization.pxi", line 411, in pyarrow.lib.deserialize_from File "pyarrow/serialization.pxi", line 262, in pyarrow.lib.SerializedPyObject.deserialize File "pyarrow/serialization.pxi", line 171, in pyarrow.lib.SerializationContext._deserialize_callback ModuleNotFoundError: No module named 'main'
System information
Describe the problem
I am trying to build a manual cluster of the machines with IP Addresses. However, When I tried to run the A2C algorithm on the cluster I got an error message from one of the workers complaining about ModuleNotFoundError: No module named "main". Here the main module is my custom gym environment. It looks like ray could not able to sync the files between different nodes. Here is the complete traceback. owl is my worker hostname.
RUNNING trials: Error processing event. Traceback (most recent call last): File "/home/mayankp/ray/python/ray/tune/trial_runner.py", line 261, in _process_events result = self.trial_executor.fetch_result(trial) File "/home/mayankp/ray/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result result = ray.get(trial_future[0]) File "/home/mayankp/ray/python/ray/worker.py", line 2392, in get raise value ray.worker.RayTaskError: ray_A2CAgent:train() (pid=17551, host=wsl) File "/home/mayankp/ray/python/ray/rllib/agents/agent.py", line 278, in train result = Trainable.train(self) File "/home/mayankp/ray/python/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/mayankp/ray/python/ray/rllib/agents/a3c/a3c.py", line 68, in _train self.optimizer.step() File "/home/mayankp/ray/python/ray/rllib/optimizers/sync_samples_optimizer.py", line 48, in step e.sample.remote() for e in self.remote_evaluators ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=16361, host=owl) File "/home/mayank/ray/python/ray/utils.py", line 404, in _wrapper return orig_attr(*args, **kwargs) File "pyarrow/_plasma.pyx", line 556, in pyarrow._plasma.PlasmaClient.get File "pyarrow/serialization.pxi", line 448, in pyarrow.lib.deserialize File "pyarrow/serialization.pxi", line 411, in pyarrow.lib.deserialize_from File "pyarrow/serialization.pxi", line 262, in pyarrow.lib.SerializedPyObject.deserialize File "pyarrow/serialization.pxi", line 171, in pyarrow.lib.SerializationContext._deserialize_callback ModuleNotFoundError: No module named 'main'
Source code / logs
Here is the algorithm configuration file I used.
local-view-multiple-lanes-improved: env: tsim-v0 run: A2C checkpoint_freq: 500 config: sample_batch_size: 20 num_envs_per_worker: 2 use_pytorch: false num_workers: 20 num_gpus: 1 lr: 0.0001 gamma: 0.995 grad_clip: 40.0 horizon: 2000 observation_filter: "NoFilter" callbacks: on_episode_end: None model: use_lstm: true
Commands Used
First start the ray head
ray start --head --redis-port=6666 --num-cpus=22 --num-gpus=1
Start ray on worker machine with above redis address
ray start --redis-address=xxx.xxx.xxx.xxx:6666
Start A2C training
python a2c_train.py