Closed vdorbala closed 3 years ago
Hi, could you paste the whole stack trace when the error occurs?
This the stack trace:
`2021-03-01 20:49:10,156 ERROR trial_runner.py:793 -- Trial MultiPPO_path_planning_79291_00000: Error processing event. Traceback (most recent call last): File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/worker.py", line 1452, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::MultiPPO.train() (pid=15203, ip=10.70.51.83) File "python/ray/_raylet.pyx", line 443, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 477, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 106, in init Trainer.init(self, config, env, logger_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 477, in init super().init(config, logger_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/trainable.py", line 249, in init self.setup(copy.deepcopy(self.config)) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 630, in setup self._init(self.config, self.env_creator) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 133, in _init self.workers = self._make_workers( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 701, in _make_workers return WorkerSet( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 87, in init self._local_worker = self._make_worker( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 310, in _make_worker worker = cls( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 456, in init self.policy_map, self.preprocessors = self._build_policy_map( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1061, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_template.py", line 205, in init dist_class, logit_dim = ModelCatalog.get_action_dist( File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/rllib/models/catalog.py", line 146, in get_action_dist dist_cls = _global_registry.get(RLLIB_ACTION_DIST, File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/tune/registry.py", line 140, in get return pickle.loads(value) File "/home/vdorbala/.conda/envs/marl/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 448, in _numpy_frombuffer array.setflags(write=isinstance(buffer, bytearray) or not buffer.readonly) AttributeError: 'bytes' object has no attribute 'readonly' == Status == Memory usage on this node: 17.4/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/32 CPUs, 0/2 GPUs, 0.0/10.55 GiB heap, 0.0/3.61 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-01_20-49-06 Number of trials: 1/1 (1 ERROR) +------------------------------------+----------+-------+ | Trial name | status | loc | |------------------------------------+----------+-------| | MultiPPO_path_planning_79291_00000 | ERROR | | +------------------------------------+----------+-------+ Number of errored trials: 1 +------------------------------------+--------------+------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |------------------------------------+--------------+------------------------------------------------------------------------------------------------------| | MultiPPO_path_planning_79291_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-01_20-49-06/MultiPPO_path_planning_79291_00000/error.txt | +------------------------------------+--------------+------------------------------------------------------------------------------------------------------+
Traceback (most recent call last):
File "/home/vdorbala/.conda/envs/marl/bin/train_policy", line 33, in
Interesting, this seems to be a version incompatibility. I didn't think it would be a problem to use a newer version, but can you try Python 3.7 instead of 3.8? I also fixed another minor bug, so you have to pull from master.
I have created a minimal example to demonstrate training of NNs with differentiable comm channels (i.e. GNNs) in this repository. It's a cleaned-up version of the trainer used here, it supports continuous action spaces and the most recent version of Ray/RLLib. In fact, the current Ray master has to be used - I might change this repository to use the updated trainer once the next Ray version is released. Let me know if you have any more questions!
Thank you. The previous issue has been resolved by changing the Python version to 3.7.x. Although we are getting a Socket error when we try to run the same training command. The training gets going for some time but we get this after some time. Any suggestion as to what could be the reason for this.
This is the stack trace:
E0309 23:48:07.580282 14882 15869 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.agents.trainer_template, class_name=MultiPPO, function_name=train, function_hash=}, task_id=cb230a572350ff44df5a1a8201000000, task_name=MultiPPO.train(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}
2021-03-09 23:48:07,832 WARNING worker.py:1091 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000.
2021-03-09 23:48:07,909 ERROR trial_runner.py:793 -- Trial MultiPPO_coverage_7e3f7_00000: Error processing event.
Traceback (most recent call last):
File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/vdorbala/.conda/envs/marl/lib/python3.7/site-packages/ray/worker.py", line 1454, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0309 23:48:09.769163 14882 15869 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.agents.trainer_template, class_name=MultiPPO, function_name=stop, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, task_name=MultiPPO.stop(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1}
2021-03-09 23:48:09,844 WARNING util.py:140 -- The process_trial
operation took 2.031278610229492 seconds to complete, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 4.7/31.1 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/32 CPUs, 0.0/2 GPUs, 0.0/15.87 GiB heap, 0.0/5.47 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52
Number of trials: 1/1 (1 ERROR)
+-------------------------------+----------+-------+
| Trial name | status | loc |
|-------------------------------+----------+-------|
| MultiPPO_coverage_7e3f7_00000 | ERROR | |
+-------------------------------+----------+-------+
Number of errored trials: 1
+-------------------------------+--------------+-------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-------------------------------+--------------+-------------------------------------------------------------------------------------------------|
| MultiPPO_coverage_7e3f7_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52/MultiPPO_coverage_7e3f7_00000/error.txt |
+-------------------------------+--------------+-------------------------------------------------------------------------------------------------+
== Status == Memory usage on this node: 4.7/31.1 GiB Using FIFO scheduling algorithm. Resources requested: 0/32 CPUs, 0.0/2 GPUs, 0.0/15.87 GiB heap, 0.0/5.47 GiB objects (0/1.0 accelerator_type:RTX) Result logdir: /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52 Number of trials: 1/1 (1 ERROR) +-------------------------------+----------+-------+ | Trial name | status | loc | |-------------------------------+----------+-------| | MultiPPO_coverage_7e3f7_00000 | ERROR | | +-------------------------------+----------+-------+ Number of errored trials: 1 +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------------------+--------------+-------------------------------------------------------------------------------------------------| | MultiPPO_coverage_7e3f7_00000 | 1 | /home/vdorbala/ray_results/MultiPPO_2021-03-09_23-45-52/MultiPPO_coverage_7e3f7_00000/error.txt | +-------------------------------+--------------+-------------------------------------------------------------------------------------------------+
Traceback (most recent call last):
File "/home/vdorbala/.conda/envs/marl/bin/train_policy", line 33, in
Hi @anubhavparas! Excuse my late reply. I have been waiting for the release of Ray 1.3.0, which finally happened this week. I updated the repository. Everything seems to work for me - can you have another try?
We are trying to setup this multi-agent training and tried to run
train_policy coverage -t 20
to train the model. But we are getting this error..conda/envs/marl/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 448, in _numpy_frombuffer array.setflags(write=isinstance(buffer, bytearray) or not buffer.readonly) AttributeError: 'bytes' object has no attribute 'readonly'
The requirements were installed as per the requirements.txt. file using the command
pip install -e .
Checked the cloud_pickle_fast.py file, but was not able to find this specific line where it is throwing the error.
Any help would be appreciated.
Thank you.