Closed anh-nn01 closed 4 months ago
Hi @anh-nn01, thanks for reaching out! I don't know exactly what the characteristics are for this task, but a couple of things that could make learning tricky: (1) episode length and discount factor (this may or may not be different from your PPO implementation), (2) action repeat (off-policy algos often require more temporal abstraction than on-policy algos), and (3) make sure to either handle early termination appropriately or disable it (our main branch does not consider termination signals, but branch episodic-rl does).
Hi @nicklashansen , thank you so much for your response! I have resolved the issue by (1) clipping the range of action space and (2) reducing the simulation timestep. These changes significantly helped stabilize the simulation and made learning easier.
I am trying to train TDMPC-2 as a benchmark on my modified JAX-based differentiable Mujoco Environment (MJX). To verify that my environment implementation is correct, I am trying to train TDMPC-2 on a very simple MJX environment: Inverted-Pendulum. However, for some reason, it cannot learn well and perform poorly despite the environment's simplicity.
Specifically, after seed pretraining, the model learned decently well after a few steps, reaching a score of 200/1000 after ~30k steps. However, its performance severely dropped afterward and forever stuck at an average reward of 5-10. I have tried training PPO in my implemented environment, and it had no trouble converging into a good policy. Therefore, I am quite confident that the issue TDMPC-2 is having is not because of the environment implementation.
Could you let me know if you encounter the same issue for the Mujoco Inverted Pendulum environment? My environment is identical to the original Mujoco environment in terms of observation space, action space, reward function, and termination conditions. The only key difference is the environment differentiability, which I did not use for training TDMPC-2.