Getting AttributeError when trying to train

vdorbala commented 3 years ago

We are trying to setup this multi-agent training and tried to run train_policy coverage -t 20 to train the model. But we are getting this error.

.conda/envs/marl/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 448, in _numpy_frombuffer array.setflags(write=isinstance(buffer, bytearray) or not buffer.readonly) AttributeError: 'bytes' object has no attribute 'readonly'

The requirements were installed as per the requirements.txt. file using the command pip install -e .

Checked the cloud_pickle_fast.py file, but was not able to find this specific line where it is throwing the error.

Any help would be appreciated.

Thank you.

janblumenkamp commented 3 years ago

Hi, could you paste the whole stack trace when the error occurs?

vdorbala commented 3 years ago

This the stack trace:

`2021-03-01 20:49:10,156 ERROR trial_runner.py:793 -- Trial MultiPPO_path_planning_79291_00000: Error processing event. Traceback (most recent call last): File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/worker.py", line 1452, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::MultiPPO.train() (pid=15203, ip=10.70.51.83) File "python/ray/_raylet.pyx", line 443, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 106, in init Trainer.init(self, config, env, logger_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 477, in init super().init(config, logger_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/trainable.py", line 249, in init self.setup(copy.deepcopy(self.config)) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 630, in setup self._init(self.config, self.env_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 133, in _init self.workers = self._make_workers( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 701, in _make_workers return WorkerSet( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 87, in init self._local_worker = self._make_worker( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 310, in _make_worker worker = cls( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 456, in init self.policy_map, self.preprocessors = self._build_policy_map( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1061, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_template.py", line 205, in init dist_class, logit_dim = ModelCatalog.get_action_dist( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/models/catalog.py", line 146, in get_action_dist dist_cls = _global_registry.get(RLLIB_ACTION_DIST, File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/registry.py", line 140, in get return pickle.loads(value) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 448, in _numpy_frombuffer array.setflags(write=isinstance(buffer, bytearray) or not buffer.readonly) AttributeError: 'bytes' object has no attribute 'readonly' == Status == Memory usage on this node: 17.4/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/32 CPUs, 0/2 GPUs, 0.0/10.55 GiB heap, 0.0/3.61 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-01_20-49-06 Number of trials: 1/1 (1 ERROR) +------------------------------------+----------+-------+ | Trial name | status | loc | |------------------------------------+----------+-------| | MultiPPO_path_planning_79291_00000 | ERROR | | +------------------------------------+----------+-------+ Number of errored trials: 1 +------------------------------------+--------------+------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |------------------------------------+--------------+------------------------------------------------------------------------------------------------------| | MultiPPO_path_planning_79291_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-01_20-49-06/MultiPPO_path_planning_79291_00000/error.txt | +------------------------------------+--------------+------------------------------------------------------------------------------------------------------+

Traceback (most recent call last): File "/home/vdorbala/.conda/envs/marl/bin/train_policy", line 33, in sys.exit(load_entry_point('adversarial-comms', 'console_scripts', 'train_policy')()) File "/home/vdorbala/marl/adversarial_comms/adversarial_comms/train_policy.py", line 107, in start_experiment tune.run( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/tune.py", line 434, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [MultiPPO_path_planning_79291_00000]) `

janblumenkamp commented 3 years ago

Interesting, this seems to be a version incompatibility. I didn't think it would be a problem to use a newer version, but can you try Python 3.7 instead of 3.8? I also fixed another minor bug, so you have to pull from master.

I have created a minimal example to demonstrate training of NNs with differentiable comm channels (i.e. GNNs) in this repository. It's a cleaned-up version of the trainer used here, it supports continuous action spaces and the most recent version of Ray/RLLib. In fact, the current Ray master has to be used - I might change this repository to use the updated trainer once the next Ray version is released. Let me know if you have any more questions!

anubhavparas commented 3 years ago

Thank you. The previous issue has been resolved by changing the Python version to 3.7.x. Although we are getting a Socket error when we try to run the same training command. The training gets going for some time but we get this after some time. Any suggestion as to what could be the reason for this.

This is the stack trace:

E0309 23:48:07.580282 14882 15869 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.agents.trainer_template, class_name=MultiPPO, function_name=train, function_hash=}, task_id=cb230a572350ff44df5a1a8201000000, task_name=MultiPPO.train(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0} 2021-03-09 23:48:07,832 WARNING worker.py:1091 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000. 2021-03-09 23:48:07,909 ERROR trial_runner.py:793 -- Trial MultiPPO_coverage_7e3f7_00000: Error processing event. Traceback (most recent call last): File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/worker.py", line 1454, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. E0309 23:48:09.769163 14882 15869 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.agents.trainer_template, class_name=MultiPPO, function_name=stop, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, task_name=MultiPPO.stop(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1} 2021-03-09 23:48:09,844 WARNING util.py:140 -- The process_trial operation took 2.031278610229492 seconds to complete, which may be a performance bottleneck. == Status == Memory usage on this node: 4.7/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/32 CPUs, 0.0/2 GPUs, 0.0/15.87 GiB heap, 0.0/5.47 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52 Number of trials: 1/1 (1 ERROR) +-------------------------------+----------+-------+ | Trial name | status | loc | |-------------------------------+----------+-------| | MultiPPO_coverage_7e3f7_00000 | ERROR | | +-------------------------------+----------+-------+ Number of errored trials: 1 +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------------------+--------------+-------------------------------------------------------------------------------------------------| | MultiPPO_coverage_7e3f7_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52/MultiPPO_coverage_7e3f7_00000/error.txt | +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+

== Status == Memory usage on this node: 4.7/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/32 CPUs, 0.0/2 GPUs, 0.0/15.87 GiB heap, 0.0/5.47 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52 Number of trials: 1/1 (1 ERROR) +-------------------------------+----------+-------+ | Trial name | status | loc | |-------------------------------+----------+-------| | MultiPPO_coverage_7e3f7_00000 | ERROR | | +-------------------------------+----------+-------+ Number of errored trials: 1 +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------------------+--------------+-------------------------------------------------------------------------------------------------| | MultiPPO_coverage_7e3f7_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52/MultiPPO_coverage_7e3f7_00000/error.txt | +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+

Traceback (most recent call last): File "/home/vdorbala/.conda/envs/marl/bin/train_policy", line 33, in sys.exit(load_entry_point('adversarial-comms', 'console_scripts', 'train_policy')()) File "/home/vdorbala/marl/adversarial_comms/adversarial_comms/train_policy.py", line 114, in start_experiment trial_dirname_creator=trial_dirname_creator, File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/tune/tune.py", line 434, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [MultiPPO_coverage_7e3f7_00000])

janblumenkamp commented 3 years ago

Hi @anubhavparas! Excuse my late reply. I have been waiting for the release of Ray 1.3.0, which finally happened this week. I updated the repository. Everything seems to work for me - can you have another try?

proroklab / adversarial_comms

Getting AttributeError when trying to train #1