[RLlib] Bugs in PyTorch version of DDPG

raphaelavalos commented 4 years ago

Hello,

I spotted some bugs in the implementation of DDPG in PyTorch.

The gradient clipping is not implemented correctly, it uses the 'grad_norm_clipping' parameter instead of 'grad_clip' and the function ray.rllib.utils.torch_ops.minimize_and_clip which is not correct. All the others algorithms seems to use ray.rllib.agents.a3c.a3c_torch_policy.apply_grad_clipping. I propose to replace minimize_and_clip by the A3C function in torch_ops.
In the PyTorch model of DDPG the parameters to bound the action space (range and minimum) are not tracked by PyTorch as they are not registered as parameters. This means that they are not converted to cuda tensors resulting in an error.
The target model is placed on the gpu even if ray was not configure to use the gpu.

I will make a PR with everything. But I don't know if I should replace minimize_and_clip.

[x] I have verified my script runs in a clean environment and reproduces the issue.
[x] I have verified the issue also occurs with the latest wheels.

duburcqa commented 4 years ago

I would add other issue: clamp is used to clamp the action at several places here and here, while min/max combinaison should be used instead. Indeed, clamp does not support element-wise operation and this limitation is currently circumvent by clamping with respect to the first low/high bounds of the action space, which does not make any sense.

raphaelavalos commented 4 years ago

I will add that :)

duburcqa commented 4 years ago

Nice ! Thank you :+1: Personally I'm doing this (for example):

policy_tp1_smoothed = torch.min(torch.max(policy_tp1 + clipped_normal_sample,
    torch.tensor(policy.action_space.low, dtype=torch.float32, device=policy_tp1.device)),
    torch.tensor(policy.action_space.high, dtype=torch.float32, device=policy_tp1.device))

duburcqa commented 4 years ago

Note that the problem appears here

sven1977 commented 4 years ago

I would add other issue: clamp is used to clamp the action at several places here and here, while min/max combinaison should be used instead. Indeed, clamp does not support element-wise operation and this limitation is currently circumvent by clamping with respect to the first low/high bounds of the action space, which does not make any sense.

Yeah, but one instance of clamping is in the Exploration component (which could be used in any other algorithm) and the other is part of the DDPG algorithm. I think this one is ok here.

sven1977 commented 4 years ago

Note that the problem appears here

Perfect, thanks! Will fix this.

sven1977 commented 4 years ago

Looking into minimize_and_clip issue. ... I think I changed it so it's more concise because we had a lot of duplicate logic in the grad-clipping code prior to adding eager support.

sven1977 commented 4 years ago

The target model is placed on the gpu even if ray was not configure to use the gpu. This should be fixed in the current master.

duburcqa commented 4 years ago

Yeah, but one instance of clamping is in the Exploration component (which could be used in any other algorithm) and the other is part of the DDPG algorithm. I think this one is ok here.

What do you mean ? Why would it be ok to clamp wrt to first element bounds only in some cases ?

duburcqa commented 4 years ago

The target model is placed on the gpu even if ray was not configure to use the gpu.

This should be fixed in the current master.

There is the same problem for the model itself at least for 0.86, and on master as far as I know (since torch.cuda.is_gpu_available is used to choose the device). Is it also fixed ?

sven1977 commented 4 years ago

Yeah, we just made a fix for this a week ago or so. I created this PR here to fix all the other problems described above. https://github.com/ray-project/ray/pull/9680

sven1977 commented 4 years ago

Please let me know, if I'm missing anything. Leaving this open until merged (probably later today).

sven1977 commented 4 years ago

On the GPU: Yes, we are no longer checking, whether we have a GPU, but whether the config actually says: num_gpus > 0 and only then place the model on the GPU.

sven1977 commented 4 years ago

@duburcqa @raphaelavalos Btw, we are very close to setting up daily automatic GPU + "heavy" regression tests (Atari, MuJoCo) to catch these things much earlier than we do right now. Hopefully, this will eliminate issues like the GPU one here altogether.

raphaelavalos commented 4 years ago

@sven1977 we did things differently so I don't know if it is relevant but in my case the tests were failing because we have two new parameters on the model. You can check my last commit (above) it fixed the tests in my case.

sven1977 commented 4 years ago

Yeah, makes sense. If your PR is ready, feel free to ping me here so we can approve and merge it.

duburcqa commented 4 years ago

I spotted yet another issue : the device may be wrong for the noise, it can be on 'cpu' device while the action is on 'gpu' device in some case (I don't know exactly when though, but it happens to me in 0.8.6). It appears here, here and here.

duburcqa commented 4 years ago

@sven1977 @raphaelavalos Other bug: Pytorch state_dict method is used to get a dict of model parameters, which is wrong. state_dict returns something way more complex than that, including attributes. named_parameters should be used instead. Moreover, it is sometimes returning a dict instead of an OrderedDict as it should be. The issue occurs here for DDPG. It should be instead:

    def policy_variables(self, as_dict=False):
        """Return the list of variables for the policy net."""
        if as_dict:
            return OrderedDict(self.policy_model.named_parameters())
        return list(self.policy_model.parameters())

    def q_variables(self, as_dict=False):
        """Return the list of variables for Q / twin Q nets."""
        if as_dict:
            return OrderedDict((
                *self.q_model.named_parameters(),
                *(self.twin_q_model.named_parameters() if self.twin_q_model else [])
            ))
        return [*self.q_model.parameters(),
                *(self.twin_q_model.parameters() if self.twin_q_model else [])]

raphaelavalos commented 4 years ago

@duburcqa I have made to change to parameters but I don't think OrderedDict is needed here.

duburcqa commented 4 years ago

@duburcqa I have made to change to parameters but I don't think OrderedDict is needed here.

Yes it is not a big deal in practice.

duburcqa commented 4 years ago

Why not to fix in this PR the clamping issue for the exploration noise I mentioned ? Because it is a more general issue than DDPG alone ?

duburcqa commented 4 years ago

Why not to fix in this PR the clamping issue for the exploration noise I mentioned ? Because it is a more general issue than DDPG alone ?

@sven1977 Thank you for having taken care of this !

With those 2 PRs, I think we are good.

raphaelavalos commented 4 years ago

@sven1977 would it be possible to keep the issue open until #9683 is merged ?

ray-project / ray

[RLlib] Bugs in PyTorch version of DDPG #9667