ModuleNotFoundError - Githubissues

mynkpl1998 commented 5 years ago

System information

OS Platform and Distribution : Ubuntu 16.04.2 LTS
Ray installed from (source or binary): Source
Ray version: 0.6
Python version: 3.6.7

Describe the problem

I am trying to build a manual cluster of the machines with IP Addresses. However, When I tried to run the A2C algorithm on the cluster I got an error message from one of the workers complaining about ModuleNotFoundError: No module named "main". Here the main module is my custom gym environment. It looks like ray could not able to sync the files between different nodes. Here is the complete traceback. owl is my worker hostname.

RUNNING trials: Error processing event. Traceback (most recent call last): File "/home/mayankp/ray/python/ray/tune/trial_runner.py", line 261, in _process_events result = self.trial_executor.fetch_result(trial) File "/home/mayankp/ray/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result result = ray.get(trial_future[0]) File "/home/mayankp/ray/python/ray/worker.py", line 2392, in get raise value ray.worker.RayTaskError: ray_A2CAgent:train() (pid=17551, host=wsl) File "/home/mayankp/ray/python/ray/rllib/agents/agent.py", line 278, in train result = Trainable.train(self) File "/home/mayankp/ray/python/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/mayankp/ray/python/ray/rllib/agents/a3c/a3c.py", line 68, in _train self.optimizer.step() File "/home/mayankp/ray/python/ray/rllib/optimizers/sync_samples_optimizer.py", line 48, in step e.sample.remote() for e in self.remote_evaluators ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=16361, host=owl) File "/home/mayank/ray/python/ray/utils.py", line 404, in _wrapper return orig_attr(*args, **kwargs) File "pyarrow/_plasma.pyx", line 556, in pyarrow._plasma.PlasmaClient.get File "pyarrow/serialization.pxi", line 448, in pyarrow.lib.deserialize File "pyarrow/serialization.pxi", line 411, in pyarrow.lib.deserialize_from File "pyarrow/serialization.pxi", line 262, in pyarrow.lib.SerializedPyObject.deserialize File "pyarrow/serialization.pxi", line 171, in pyarrow.lib.SerializationContext._deserialize_callback ModuleNotFoundError: No module named 'main'

Source code / logs

Here is the algorithm configuration file I used. local-view-multiple-lanes-improved: env: tsim-v0 run: A2C checkpoint_freq: 500 config: sample_batch_size: 20 num_envs_per_worker: 2 use_pytorch: false num_workers: 20 num_gpus: 1 lr: 0.0001 gamma: 0.995 grad_clip: 40.0 horizon: 2000 observation_filter: "NoFilter" callbacks: on_episode_end: None model: use_lstm: true

Commands Used

First start the ray head ray start --head --redis-port=6666 --num-cpus=22 --num-gpus=1
Start ray on worker machine with above redis address ray start --redis-address=xxx.xxx.xxx.xxx:6666
Start A2C training python a2c_train.py

robertnishihara commented 5 years ago

Interesting, this potentially looks like cloudpickle is failing to serialize something.

richardliaw commented 4 years ago

closing because stale

ray-project / ray

ModuleNotFoundError #3544

System information

Describe the problem

Source code / logs

Commands Used