Closed spsingh37 closed 5 months ago
@SuryaPratapSingh37 Hi, thanks for the issue. Generally speaking, there is not usually convergence in deep RL training because of the nonstationary nature. So it is what it is. Also, the TD error is not a really good metrics. People don't usually use it to measure training progress.
@takuseno Thanks for your reply. So could u pls guide like exactly how should I be changing the above code to make it converge? I felt cartpole is a pretty simple environment & at least the loss should have been decreasing (if not overfitting), and secondly, if not TD error what else should I use here to examine the training loss (for finding whether its overfitting or not)?
Sadly, particularly in offline deep RL, it's very difficult to prevent divergence. So my recommendation is to give up on the convergence. Also, in offline RL, there is no good metrics to measure policy performance yet. I'd direct you to this documentation and a paper:
Offline deep RL still needs a lot of inventions to make it practical :sweat:
Ohh.....do you know whether during training the transitions are sampled randomly from the replay buffer (if not how to randomly shuffle the transitions)?
Yes, the mini-batch is uniformly sampled from the buffer.
Another suggestion to prevent the divergence is using offline RL algorithms. For now, it looks like you're using DQN, which is designed for online training. If you use DiscreteCQL
instead, you might get the better results in offline setting.
Please let me close this issue since it's simply a nature of offline RL.
Hi, I was just getting started with this amazing d3rlpy library, and wanted to train a very simple policy using DQN on the cartpole environment. But I'm not sure why the loss and TD errors (both validation & training) keep increasing. I tried increasing the n_steps & n_steps_per_epoch but no success. Even it had been over-fitting, then atleast the loss and training TD error should've been decreasing. Can you please help?
Attaching the code & plots below