Closed janblumenkamp closed 4 years ago
Thanks for filing this issue @janblumenkamp! Could you try this PR and see, whether this fixes your problem? https://github.com/ray-project/ray/pull/7445
Thanks for the quick PR! Unfortunately it does not work since the torch tensors in the stats
dict are contained by numpy arrays (e.g. {"t": np.array([torch.tensor([0])], dtype=object)}
), which are not traversed by the dm-tree
package :( If, instead of the numpy array of objects a standard Python list was used, this would work. I tried checking if the instance if item
in the mapping
function is a numpy array and if so return it as a list, but apparently map_structure
does not allow modifyng structures on the go.
Is there any reason why stats
uses a numpy array of objects instead of a list? Can this be changed?
Cool, thanks for checking. Yeah, this doesn't look right. We shouldn't use np.array of objects. There is no reason for that. I'll check again.
Ok, so I ran the entire example you provided and after starting the client, I do get lots of actions displayed on the screen. Are you sure you have the latest version of ray's (pip install -U ray ray[rllib]
plus pip install -U [nightly wheel for your platform]
?
I'm on MacOS: pytorch==1.4.0 python==3.7.6 numpy==1.17.4 ray==0.9.0dev0
Server output: ... 127.0.0.1 - - [04/Mar/2020 15:41:21] "POST / HTTP/1.1" 200 - 127.0.0.1 - - [04/Mar/2020 15:41:21] "POST / HTTP/1.1" 200 - 127.0.0.1 - - [04/Mar/2020 15:41:21] "POST / HTTP/1.1" 200 - ...
Client output: ... {'a_0': 3, 'a_1': 0, 'a_2': 4} {'a_0': 2, 'a_1': 4, 'a_2': 0} {'a_0': 3, 'a_1': 1, 'a_2': 3} {'a_0': 4, 'a_1': 3, 'a_2': 1} {'a_0': 2, 'a_1': 1, 'a_2': 3} {'a_0': 4, 'a_1': 2, 'a_2': 1} {'a_0': 0, 'a_1': 1, 'a_2': 0} {'a_0': 4, 'a_1': 4, 'a_2': 3} {'a_0': 1, 'a_1': 2, 'a_2': 0} {'a_0': 0, 'a_1': 2, 'a_2': 4} {'a_0': 0, 'a_1': 0, 'a_2': 2} {'a_0': 2, 'a_1': 4, 'a_2': 1} {'a_0': 1, 'a_1': 3, 'a_2': 4} {'a_0': 0, 'a_1': 2, 'a_2': 0} {'a_0': 0, 'a_1': 1, 'a_2': 1} {'a_0': 0, 'a_1': 4, 'a_2': 2} {'a_0': 0, 'a_1': 2, 'a_2': 2} {'a_0': 4, 'a_1': 4, 'a_2': 0}
So what I meant was: After you upgrade to the latest RLlib, add this PR from today: https://github.com/ray-project/ray/pull/7445
I may have used an RLLib version that was one or two days old, sorry. With the latest RLlib (not adding your pull request) in a newly set up virtualenv, I get a segmentation fault:
127.0.0.1 - - [04/Mar/2020 16:20:00] "POST / HTTP/1.1" 200 -
*** Aborted at 1583338800 (unix time) try "date -d @1583338800" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x1) received by PID 30454 (TID 0x7f70d6e60740) from PID 1; stack trace: ***
@ 0x7f70d68c2f20 (unknown)
@ 0x7f70d6a125a1 (unknown)
@ 0x7f70d68e14d3 _IO_vfprintf
@ 0x7f70d690c910 vsnprintf
@ 0x7f707f176154 torch::formatMessage()
@ 0x7f707f176476 torch::TypeError::TypeError()
@ 0x7f707f469cee torch::utils::(anonymous namespace)::new_with_tensor()
@ 0x7f707f46cff7 torch::utils::legacy_tensor_ctor()
@ 0x7f707f2ba496 THPVariable_pynew()
@ 0x551b15 (unknown)
@ 0x5aa6ec _PyObject_FastCallKeywords
@ 0x50abb3 (unknown)
@ 0x50c5b9 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x50a080 (unknown)
@ 0x50aa7d (unknown)
@ 0x50c5b9 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x509642 _PyFunction_FastCallDict
@ 0x595311 (unknown)
@ 0x54a6ff (unknown)
@ 0x551b81 (unknown)
@ 0x5a067e PyObject_Call
@ 0x50d966 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x50a080 (unknown)
@ 0x50aa7d (unknown)
@ 0x50c5b9 _PyEval_EvalFrameDefault
@ 0x508245 (unknown)
@ 0x50a080 (unknown)
@ 0x50aa7d (unknown)
@ 0x50d390 _PyEval_EvalFrameDefault
Segmentation fault (core dumped)
But I doubt it has anything to do with this issue. I will try Python 3.7 then.
Ok, please keep us posted. Yeah, this went in just a few days ago, so it's important you really use the very latest build.
Nope, no matter what I try, I keep getting this exception (the one from my previous comment) with the most recent builds both with Python 3.6 and 3.7. Could someone else who uses Linux verify that? Maybe something else is wrong with my setup?
Confirmed on my setup; I built Ray 'recently' (March 3, 2020) at /ray
:
user@hostname$ uname -a
Linux hostname 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
user@hostname$ python --version
Python 3.7.4
Python package versions:
from importlib import import_module
import yaml
version_lookup = {'torch': '', 'tensorflow': '', 'numpy': '', 'ray': ''}
for module in version_lookup.keys():
imprt = import_module(module)
version_lookup[module] = imprt.__version__
print(yaml.dump(version_lookup))
numpy: 1.16.6
ray: 0.9.0.dev0
tensorflow: 1.14.1
torch: 1.4.0
Configuration/setup:
import ray
from ray.rllib.agents.ppo.ddppo import DDPPOTrainer, DEFAULT_CONFIG
import yaml
config = DEFAULT_CONFIG.copy()
with open('/ray/rllib/tuned_examples/atari-ddppo.yaml','r') as f:
tuned_example = yaml.safe_load(f)
for key, val in tuned_example['atari-ddppo']['config'].items():
config[key] = val
config['num_workers'] = 1
Ray calls:
ray.init(address="hostIP:6379", redis_password='redis_password', ignore_reinit_error=True)
agent = DDPPOTrainer(config, 'BreakoutNoFrameskip-v4')
agent.train()
Ray errors:
2020-03-05 08:57:26,745 INFO trainer.py:423 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-03-05 08:57:26,759 INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
---------------------------------------------------------------------------
RayTaskError(TypeError) Traceback (most recent call last)
<ipython-input-2-b53e91f7fd27> in <module>
1 ray.init(address="hostIP:6379", redis_password='redis_password', ignore_reinit_error=True)
2 agent = DDPPOTrainer(config, 'BreakoutNoFrameskip-v4')
----> 3 agent.train()
/ray/python/ray/rllib/agents/trainer.py in train(self)
495 "continue training without the failed worker, set "
496 "`'ignore_worker_failures': True`.")
--> 497 raise e
498 except Exception as e:
499 time.sleep(0.5) # allow logs messages to propagate
/ray/python/ray/rllib/agents/trainer.py in train(self)
484 for _ in range(1 + MAX_WORKER_FAILURE_RETRIES):
485 try:
--> 486 result = Trainable.train(self)
487 except RayError as e:
488 if self.config["ignore_worker_failures"]:
/ray/python/ray/tune/trainable.py in train(self)
252 """
253 start = time.time()
--> 254 result = self._train()
255 assert isinstance(result, dict), "_train() needs to return a dict."
256
/ray/python/ray/rllib/agents/trainer_template.py in _train(self)
137 start = time.time()
138 while True:
--> 139 fetches = self.optimizer.step()
140 if after_optimizer_step:
141 after_optimizer_step(self, fetches)
/ray/python/ray/rllib/optimizers/torch_distributed_data_parallel_optimizer.py in step(self)
64 self.expected_batch_size, self.num_sgd_iter,
65 self.sgd_minibatch_size, self.standardize_fields)
---> 66 for w in self.workers.remote_workers()
67 ])
68 for info, count in results:
/ray/python/ray/worker.py in get(object_ids, timeout)
1502 worker.core_worker.dump_object_store_memory_usage()
1503 if isinstance(value, RayTaskError):
-> 1504 raise value.as_instanceof_cause()
1505 else:
1506 raise value
RayTaskError(TypeError): ray::RolloutWorker.sample_and_learn() (pid=4004, ip=IP)
File "python/ray/_raylet.pyx", line 448, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 426, in ray._raylet.execute_task.function_executor
File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 652, in sample_and_learn
batch = self.sample()
File "/ray/python/ray/rllib/evaluation/rollout_worker.py", line 489, in sample
batches = [self.input_reader.next()]
File "/ray/python/ray/rllib/evaluation/sampler.py", line 53, in next
batches = [self.get_data()]
File "/ray/python/ray/rllib/evaluation/sampler.py", line 96, in get_data
item = next(self.rollout_provider)
File "/ray/python/ray/rllib/evaluation/sampler.py", line 316, in _env_runner
soft_horizon, no_done_at_end)
File "/ray/python/ray/rllib/evaluation/sampler.py", line 462, in _process_observations
episode.batch_builder.postprocess_batch_so_far(episode)
File "/ray/python/ray/rllib/evaluation/sample_batch_builder.py", line 153, in postprocess_batch_so_far
pre_batch, other_batches, episode)
File "/ray/python/ray/rllib/policy/torch_policy_template.py", line 109, in postprocess_trajectory
episode)
File "/ray/python/ray/rllib/agents/ppo/ppo_tf_policy.py", line 191, in postprocess_ppo_gae
use_gae=policy.config["use_gae"])
File "/ray/python/ray/rllib/evaluation/postprocessing.py", line 45, in compute_advantages
traj[key] = np.stack(rollout[key])
File "/usr/local/lib/python3.7/site-packages/numpy/core/shape_base.py", line 410, in stack
arrays = [asanyarray(arr) for arr in arrays]
File "/usr/local/lib/python3.7/site-packages/numpy/core/shape_base.py", line 410, in <listcomp>
arrays = [asanyarray(arr) for arr in arrays]
File "/usr/local/lib/python3.7/site-packages/numpy/core/numeric.py", line 591, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
File "/usr/local/lib/python3.7/site-packages/torch/tensor.py", line 486, in __array__
return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
Could you post the contents of your lines 100-120 in rllib/policy/torch_policy_template.py?
Do you have calls to convert_to_non_torch_type
in there?
In that function, we are moving the tensors to cpu(), then numpy'ize them:
def convert_to_non_torch_type(stats):
# The mapping function used to numpyize torch Tensors.
def mapping(item):
if isinstance(item, torch.Tensor):
return item.cpu().item() if len(item.size()) == 0 else \
item.cpu().numpy()
else:
return item
return tree.map_structure(mapping, stats)
The above changes were done in PR #7445 and have not been merged into master yet! https://github.com/ray-project/ray/pull/7445
The most recent commit in master that does not throw the segmentation fault for me is fb1c1e2d27348b139bf1b686a1e81b94049d5190. If I apply your patch to that commit, the exception persists since the 'action_prob' and the 'action_logp' key consists of a numpy array of tensors, which can't be traversed by the tree library. I can verify if your PR fixes the problems for later commits when the segmentation fault is not thrown anymore.
@sven1977 -- Here are my lines 100-120 (maybe not what you're looking for): Call:
user@hostname:/ray/rllib/policy$ head -n 120 torch_policy_template.py | tail -n 20
Result:
episode=None):
if not postprocess_fn:
return sample_batch
# Do all post-processing always with no_grad().
# Not using this here will introduce a memory leak (issue #6962).
with torch.no_grad():
return postprocess_fn(self, sample_batch, other_agent_batches,
episode)
@override(TorchPolicy)
def extra_grad_process(self):
if extra_grad_process_fn:
return extra_grad_process_fn(self)
else:
return TorchPolicy.extra_grad_process(self)
@override(TorchPolicy)
def extra_action_out(self, input_dict, state_batches, model,
action_dist=None):
OTOH... Call:
user@hostname:/ray/rllib/policy$ grep -n convert_to_non_torch_type torch_policy_template.py
Result:
8:from ray.rllib.utils.torch_ops import convert_to_non_torch_type
130: return convert_to_non_torch_type(stats_dict)
146: return convert_to_non_torch_type(stats_dict)
Here are the lines preceding the line 130 reference:
@override(TorchPolicy)
def extra_action_out(self, input_dict, state_batches, model,
action_dist=None):
with torch.no_grad():
if extra_action_out_fn:
stats_dict = extra_action_out_fn(
self, input_dict, state_batches, model, action_dist
)
else:
stats_dict = TorchPolicy.extra_action_out(
self, input_dict, state_batches, model, action_dist
)
return convert_to_non_torch_type(stats_dict) # THIS IS LINE 130
and the lines preceding the line 146 reference:
@override(TorchPolicy)
def extra_grad_info(self, train_batch):
with torch.no_grad():
if stats_fn:
stats_dict = stats_fn(self, train_batch)
else:
stats_dict = TorchPolicy.extra_grad_info(self, train_batch)
return convert_to_non_torch_type(stats_dict) # THIS IS LINE 146
Now it works for me (meanwhile the PR has been merged into master). Thanks a lot, Sven!
Hi, hate to bump thread, but I am getting the said error with PPOTtrainer, but not the PGTrainer...
Seconded @maulberto3
Seconded @maulberto3
Hi, I made more tests, and interesting thing is that there seems to be no problem using CLI rllib commands, while doing scripting there is... Also, related somewhat, the num_gpus parameter seems not to be working well, even when setting it to num_gpus=0... Hope this helps.
Hi, I made more tests, and interesting thing is that there seems to be no problem using CLI rllib commands, while doing scripting there is... Also, related somewhat, the num_gpus parameter seems not to be working well, even when setting it to num_gpus=0... Hope this helps.
I got it working by installing the latest nightly (this morning), so it seems this has been addressed for PPO, but not released.
What is the problem?
I am using a multi-agent setup with PPO and PyTorch. I set up a basic environment and now want to run serving in this environment. This works fine with TensorFlow, but when using PyTorch the exception
can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first
is thrown, which seems to originate from the Torch PPO Policy implementation.Reproduction (REQUIRED)
Mock environment
issue_world.py
:Custom Torch model
issue_model.py
:Model training script
train.py
(just run for one iteration):Serving script
serving_server.py
based on this (after the first checkpoint was created in the previous script put the correct checkpoint path inCHECKPOINT_FILE
):And the client script
serving_client.py
based on this (run after the last script has successfully started):Interestingly, the client receives exactly one action and then the server interrupts the connection. The output of the server window is
For reference, this is the TensorFlow model
issue_model_tf.py
where the issue does not occur: