ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.28k stars 5.82k forks source link

[rllib] Pytorch missing custom loss #8507

Closed waldroje closed 4 years ago

waldroje commented 4 years ago

What is the problem?

Running rllib/examples/custom_loss.py example, the tensorflow implementation displays the custom loss, but Pytorch does not. During the process, I had to update the rllib/examples/models/custom_loss_model.py models method name from custom_statsto metrics.

Ray version and other system information (Python version, TensorFlow version, OS): Ray version: latest wheel Python: 3.6.8 TF: 2.1 OS: RHEL 7.7

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

Use existing example script in Rllib

If we cannot run your script, we cannot fix your issue.

janblumenkamp commented 4 years ago

8193

sven1977 commented 4 years ago

Sorry, this may have slipped through the cracks. I'll take a look and provide the missing example.

sven1977 commented 4 years ago

Got it, ModelV2.custom_loss() is not supported for PyTorch yet at all (ignored). Will fix this.

Yes, metrics is the new name (ModelV2) for the old custom_stats (Model(V1)).

waldroje commented 4 years ago

I'm looking closer at the current tf policies, and I'm actually struggling to see where model.custom_loss has been implemented in tf IMPALA either... Can you highlight the part of the code where this is done for my benefit? In the case of IMPALA, I know it calls the model.custom_loss method, as I can see the custom loss metrics I've produced, but what is not clear is that it is actually making it into the gradient calculations... build_vtrace_loss seems to just call the class VTraceLoss, where I don't see any references to the output ofmodel.custom_loss. and dynamic_tf_policy seems to just return the output of this loss function... so I'm struggling to see where it comes into play.

janblumenkamp commented 4 years ago

Looks like it is only used once here and from there on it's accessed through self._loss.

waldroje commented 4 years ago

Thanks for that, not sure how that eluded me… it looks like on further review my issue on the TF side was using 1.x style tf.losses.get_regularization_loss rather than model.losses … once I switched everything is working for TF models.

On May 28, 2020, at 4:50 PM, Jan Blumenkamp notifications@github.com wrote:

Looks like it is only used once here and from there on it's accessed through self._loss.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cedros23 commented 4 years ago

Got it, ModelV2.custom_loss() is not supported for PyTorch yet at all (ignored). Will fix this.

Yes, metrics is the new name (ModelV2) for the old custom_stats (Model(V1)).

Hi! Is there any update on this? If update is not expected soon, I might give it a shot with tensorflow side. Many thanks.

sven1977 commented 4 years ago

Hey, @cedros23 . Taking another look right now.

sven1977 commented 4 years ago

This PR (hopefully) fixes this issue: https://github.com/ray-project/ray/pull/9142 I did confirm via our now-working example script: rllib/examples/custom_loss.py --torch.

Note that you need to: a) return an altered list of policy-losses (same length) from the custom_loss function (no extra custom-loss optimizer; custom_loss gets e.g. added to policy loss(es)). b) return a list that has the policy losses as well as the custom loss(es) in it (list must have same length as there are optimizers in the torch policy). See the: rllib/examples/models/custom_loss_model.py::TorchCustomLossModel for an example for both these cases.

Closing this issue. Feel free to re-open should this not solve the problem on your end.

sven1977 commented 4 years ago

@janblumenkamp

sven1977 commented 4 years ago

@cedros23

cedros23 commented 4 years ago

Hello again,

I try to test and expand the example file -> rllib/examples/models/custom_loss_model.py (I build ray from source - master) I confirm it works in the standard cartpole example with --torch flag. However, if I change PG method to the DQN it gives the following error: (This error is particular to the torch, Tensorflow works fine with DQN)

File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/trainer_template.py", line 97, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/trainer.py", line 469, in __init__
    super().__init__(config, logger_creator)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/tune/trainable.py", line 231, in __init__
    self._setup(copy.deepcopy(self.config))
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/trainer.py", line 643, in _setup
    self._init(self.config, self.env_creator)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/trainer_template.py", line 122, in _init
    self.config["num_workers"])
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/trainer.py", line 713, in _make_workers
    logdir=self.logdir)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/evaluation/worker_set.py", line 67, in __init__
    RolloutWorker, env_creator, policy, 0, self._local_config)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/evaluation/worker_set.py", line 296, in _make_worker
    extra_python_environs=extra_python_environs)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/evaluation/rollout_worker.py", line 416, in __init__
    policy_dict, policy_config)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/evaluation/rollout_worker.py", line 971, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/policy/torch_policy_template.py", line 118, in __init__
    self, obs_space, action_space, config)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/dqn/dqn_torch_policy.py", line 112, in build_q_model_and_distribution
    add_layer_norm=add_layer_norm)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/models/catalog.py", line 364, in get_model_v2
    model_config, name, **model_kwargs)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/rllib/agents/dqn/dqn_torch_model.py", line 51, in __init__
    num_outputs, model_config, name)
TypeError: __init__() missing 1 required positional argument: 'input_files'
Traceback (most recent call last):
  File "/home/eatron/Sources/GymStar/ray_source/ray/rllib/examples/custom_loss.py", line 67, in <module>
    tune.run("DQN", config=config, stop=stop)
  File "/home/eatron/Sources/GymStar/ray_source/ray/python/ray/tune/tune.py", line 356, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [DQN_CartPole-v0_82195_00000])
cedros23 commented 4 years ago

Quick update, I have compared the custom_loss models of TF and Torch. It appears that the error raised above is due the fact that pytorch custom model gets the input file as an argument during object creation whereas, tensorflow reads it via model config dictionary.

If I copy the style of TF to the torch as follows and remove the input_files arg in the Init, custom_loss example start works with DQN: self.reader = JsonReader(self.model_config["custom_model_config"]["input_files"])

I will further test if this leads to new problems. I would appreciate your opinions.

Thanks & Regards

sven1977 commented 4 years ago

Yeah, that makes perfect sense. We'll fix this as well. Thanks so much for your investigation into this! @cedros23