ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

ModuleNotFoundError #3544

Closed mynkpl1998 closed 4 years ago

mynkpl1998 commented 5 years ago

System information

Describe the problem

I am trying to build a manual cluster of the machines with IP Addresses. However, When I tried to run the A2C algorithm on the cluster I got an error message from one of the workers complaining about ModuleNotFoundError: No module named "main". Here the main module is my custom gym environment. It looks like ray could not able to sync the files between different nodes. Here is the complete traceback. owl is my worker hostname.

RUNNING trials: Error processing event. Traceback (most recent call last): File "/home/mayankp/ray/python/ray/tune/trial_runner.py", line 261, in _process_events result = self.trial_executor.fetch_result(trial) File "/home/mayankp/ray/python/ray/tune/ray_trial_executor.py", line 211, in fetch_result result = ray.get(trial_future[0]) File "/home/mayankp/ray/python/ray/worker.py", line 2392, in get raise value ray.worker.RayTaskError: ray_A2CAgent:train() (pid=17551, host=wsl) File "/home/mayankp/ray/python/ray/rllib/agents/agent.py", line 278, in train result = Trainable.train(self) File "/home/mayankp/ray/python/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/mayankp/ray/python/ray/rllib/agents/a3c/a3c.py", line 68, in _train self.optimizer.step() File "/home/mayankp/ray/python/ray/rllib/optimizers/sync_samples_optimizer.py", line 48, in step e.sample.remote() for e in self.remote_evaluators ray.worker.RayTaskError: ray_PolicyEvaluator:sample() (pid=16361, host=owl) File "/home/mayank/ray/python/ray/utils.py", line 404, in _wrapper return orig_attr(*args, **kwargs) File "pyarrow/_plasma.pyx", line 556, in pyarrow._plasma.PlasmaClient.get File "pyarrow/serialization.pxi", line 448, in pyarrow.lib.deserialize File "pyarrow/serialization.pxi", line 411, in pyarrow.lib.deserialize_from File "pyarrow/serialization.pxi", line 262, in pyarrow.lib.SerializedPyObject.deserialize File "pyarrow/serialization.pxi", line 171, in pyarrow.lib.SerializationContext._deserialize_callback ModuleNotFoundError: No module named 'main'

Source code / logs

Here is the algorithm configuration file I used. local-view-multiple-lanes-improved: env: tsim-v0 run: A2C checkpoint_freq: 500 config: sample_batch_size: 20 num_envs_per_worker: 2 use_pytorch: false num_workers: 20 num_gpus: 1 lr: 0.0001 gamma: 0.995 grad_clip: 40.0 horizon: 2000 observation_filter: "NoFilter" callbacks: on_episode_end: None model: use_lstm: true

Commands Used

robertnishihara commented 5 years ago

Interesting, this potentially looks like cloudpickle is failing to serialize something.

richardliaw commented 4 years ago

closing because stale