sebascuri / hucrl

MIT License
30 stars 5 forks source link

experiments on mujoco Pusher #2

Closed yesiam-png closed 3 years ago

yesiam-png commented 3 years ago

Hi Sebastian, first thanks for your excellent code and paper! However, the BPTT and Data_Augmentation agents fail to accomplish the Pusher task in the simulation and output a very low return, e.g., -416.11. I have only tried these two agents in the Pusher environment, so I am not sure if I run it correctly. E.g., for BPTT agents, I run python exps/mujoco/run.py --environment MBPusher-v0 --agent BPTT --config-file exps/mujoco/config/bptt.yaml

sebascuri commented 3 years ago

Hi Yesiam,

Thanks! I will look into it. Two questions: is that return the last one or the maximum over the runs? Following Chua et al. I plot the accumulated maximum in the paper. Also, could you check MPC too? MPC usually performs best because there could be some bugs right now in the termination of the simulation.

Thanks, Seb.

yesiam-png commented 3 years ago

Hi Seb, Thanks for your reply! The algorithm code seems to be fine, do you mean bugs in the mujoco simulation rendering code? And I can only run MPC in the MBHopper environment. MPC agents get stuck at the end of the 5th epoch in all other environments (and terminates after a long time). Have you encountered this? The CPU I am using is 32 GB of memory.

sebascuri commented 3 years ago

Hi Yesiam,

No, I never encounter such problems but my implementation of MPC is definitely very slow, that is why I suggested using the other algorithms (although I should've checked for the Pusher). Maybe you can try out with less num_iter or num_particles in the mpc solver.

I meant bugs in the dimension of the tensors. Sometimes pytorch does some tensor expansions instead of popping an error which yields bad results. As I'm working on other projects, I am sometimes modifying the rllib, and H-UCRL depends on it.

Let me know if you have any other questions.

yesiam-png commented 3 years ago

Thanks Seb! Sorry to bother you again but I still have some questions. 1. In the Mujoco experiments, is MPC the agent type that you report in the paper, e.g., Fig 3 and 4? Could you suggest the settings for reimplementation in the Mujoco?

  1. After changing num_iter and num_particles, the MPC agents still get stuck in the 5th episode. By replacing the planning horizon from 50 to a smaller number, eg 30, the training can continue. However, the return is still low, eg, -350 for Pusher. I don't know if horizon=50 helps.
  2. For the MPO algorithm, in the M-step, why the KL divergence constraints are ignored? The corresponding code is in your rllib repo, rllib/algorithms/mpo.py/Line90.
sebascuri commented 3 years ago

Hi Yesiam,

  1. It is the MPC agent, but I'm pretty sure it should be possible that other workers work too. I used the same setting as in Chua et. al for it! Maybe their implementation is more efficient, you can go for it and then you wouldn't change much.
  2. What do you mean by it gets stuck? Is it too slow or it crashes?
  3. I don't! In line 280 of https://github.com/sebascuri/rllib/blob/master/rllib/algorithms/abstract_algorithm.py, the kl_loss is computed. Then in line 318 of the same page, I add them all together.
yesiam-png commented 3 years ago

Thanks for your clarification! MPC crashes on my PC, and I will try your code on GPU as well as PETS code. I'll close this issue since it has been clarified. Thanks Seb!

yesiam-png commented 3 years ago

Hi Seb, when I run in the MBReacher3d_v0 environment, an error raised:

_File "hucrl/rllib/policy/nn_policy.py", line 148, in forward state = self._preprocess_state(state) File "hucrl/rllib/policy/nn_policy.py", line 142, in _preprocessstate state = torch.cat((state, goal), dim=-1) TypeError: expected Tensor as element 1 in argument 0, but got numpy.ndarray

When I replace the goal by goal = torch.tensor(self.goal, dtype=torch.float), another error raised:

_File "hucrl/rllib/util/neural_networks/neural_networks.py", line 153, in forward x = self.hidden_layers(x) File "anaconda3/envs/hucrl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "anaconda3/envs/hucrl/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "anaconda3/envs/hucrl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "anaconda3/envs/hucrl/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward return F.linear(input, self.weight, self.bias) File "anaconda3/envs/hucrl/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear output = input.matmul(weight.t()) RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x20 and 17x200)_

When print(self.goal) at https://github.com/sebascuri/rllib/blob/master/rllib/policy/nn_policy.py#L126, it outputs None.

sebascuri commented 3 years ago

Hi yesiam,

Thanks for catching it! So I see what the problem is: I changed the goal from a parameter to extra state dimensions. I already fixed this bug. As a side effect, the model learning is not working as expected, but I'm working to fix this. I will return to the goal as a parameter asap.

Thanks, Seb.

yesiam-png commented 3 years ago

Thanks Seb! I find that the training return of hucrl agents in inverted pendulum task is pretty good! But after several days of experiments, I still can't get your results for all Mujoco tasks, with your elder version rllib (the one that raises error in Reacher task). For example, with optimistic exploration, bptt performs as follows: image Is the above result normal? Also, are you running experiments on CPU only? since I didn't find the cuda option in your rllib code.

sebascuri commented 3 years ago

Hi Yesiam,

Could you try now? Yes I run on CPU only.

yesiam-png commented 3 years ago

Thanks for your reply Seb! I still can't get the expected results in the HalfCheetah task (training return<1000 for the BPTT, DataAugmentation, MVE agents in 300 episodes). The two repositories (hucrl and your rllib) I use are up to date. I'll really appreciate it if you could check it!

sebascuri commented 3 years ago

Do you still get the same train returns as before? Are you using the default agents?

yesiam-png commented 3 years ago

Yes the training returns are the same as the picture above. The command I use, e.g. for dataaugmentation agent is: python exps/mujoco/run.py --agent DataAugmentation --env-config-file exps/mujoco/config/envs/half-cheetah.yaml --agent-config-file exps/mujoco/config/agents/data_augmentation.yaml --train-episodes 400. Below is what I get: image

sebascuri commented 3 years ago

Hi Yesiam, This is way better than before! Also note that we plot the maximum cumulative returns in the paper following Chua et al. 2018, not the current train return. I think 2500 is already ok for performance in Half Cheetah. However, if you want even more performance I could only make it with MPC.

yesiam-png commented 3 years ago

Thanks for your reply Seb! Without hucrl, the DataAugmentation agent can achieve a return of 5000. Would you suggest some possible modifications for the hucrl_DataAugmentation agent to make them comparable? For the MPC agent, do we need num_samples=500 that large? Currently, I can't get the default MPC set to run on my PC. By changing it from 500 to 50, the training return is 1300. (it's still much slower than DataAug or BPTT even with num_samples=50)

sebascuri commented 3 years ago

Hi Yesiam,

That is for the default action cost right? Essentially, when \beta=0, then the HUCRL agent should perform as the expected agent, so you can tune \beta if you want to.

I think MPC performs better because for H-UCRL to work, one needs to solve an optimization problem and MPC does so approximately, whereas DataAugmentation and BPTT do so partially (they don't fully optimize the policy by doing some gradient steps). It is possible that by expanding the action space from n_action to n_action + n_states then it needs more iterations for it to converge and this could explain this issue. MPC does not have this problem.

For the num_samples issue, more samples = better optimization as it is a shooting method. You can try with 400/500 samples but maybe a shorter horizon to see if this works better for you.

yesiam-png commented 3 years ago

Thanks for your explanation! I'll try your suggestions!

sebascuri commented 3 years ago

Hi @yesiam-png, I realized that sometimes the default model was too big and, in some computers, it has a massive slowdown the execution of the hidden-layers. If you reduce the depth from (200, 200, 200, 200, 200) to (200, 200, 200) performance is not hindered by much and the MPC runs faster.