takuseno / d3rlpy

An offline deep reinforcement learning library
https://takuseno.github.io/d3rlpy
MIT License
1.33k stars 243 forks source link

[QUESTION] Continously increasing loss and TD error #387

Closed spsingh37 closed 5 months ago

spsingh37 commented 6 months ago

Hi, I was just getting started with this amazing d3rlpy library, and wanted to train a very simple policy using DQN on the cartpole environment. But I'm not sure why the loss and TD errors (both validation & training) keep increasing. I tried increasing the n_steps & n_steps_per_epoch but no success. Even it had been over-fitting, then atleast the loss and training TD error should've been decreasing. Can you please help?

Attaching the code & plots below

import d3rlpy
from d3rlpy.datasets import get_cartpole # CartPole-v0 dataset
from d3rlpy.datasets import get_pendulum # Pendulum-v0 dataset
# from d3rlpy.datasets import get_pybullet # PyBullet task datasets
from d3rlpy.datasets import get_atari    # Atari 2600 task datasets
from d3rlpy.dataset import create_infinite_replay_buffer
from d3rlpy.algos import DQNConfig
import matplotlib.pyplot as plt
import pandas as pd
dataset, env = get_cartpole()

from sklearn.model_selection import train_test_split

train_episodes, test_episodes = train_test_split(dataset.episodes, test_size=0.2)
train_dataset = create_infinite_replay_buffer(episodes=train_episodes)

from d3rlpy.algos import DQN

dqn = DQNConfig().create()

# Track validation TD error
val_td_scorer = d3rlpy.metrics.TDErrorEvaluator(episodes=test_episodes)
# Track training TD error
train_td_scorer = d3rlpy.metrics.TDErrorEvaluator()

dqn.fit(train_dataset, n_steps=100000, n_steps_per_epoch=10000, evaluators={"val_td_scorer": val_td_scorer, "train_td_scorer":train_td_scorer})

df = pd.read_csv('d3rlpy_logs/DQN_20240418233428/loss.csv', header=None)
df2 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/val_td_scorer.csv', header=None)
df3 = pd.read_csv('d3rlpy_logs/DQN_20240418233428/train_td_scorer.csv', header=None)

plt.plot(df[1], df[2])
plt.plot(df2[1], df2[2])
plt.plot(df3[1], df3[2])
plt.legend(['loss', 'val_td_scorer', 'train_td_scorer'])
plt.show()

cartpole_train_results

takuseno commented 6 months ago

@SuryaPratapSingh37 Hi, thanks for the issue. Generally speaking, there is not usually convergence in deep RL training because of the nonstationary nature. So it is what it is. Also, the TD error is not a really good metrics. People don't usually use it to measure training progress.

spsingh37 commented 6 months ago

@takuseno Thanks for your reply. So could u pls guide like exactly how should I be changing the above code to make it converge? I felt cartpole is a pretty simple environment & at least the loss should have been decreasing (if not overfitting), and secondly, if not TD error what else should I use here to examine the training loss (for finding whether its overfitting or not)?

takuseno commented 6 months ago

Sadly, particularly in offline deep RL, it's very difficult to prevent divergence. So my recommendation is to give up on the convergence. Also, in offline RL, there is no good metrics to measure policy performance yet. I'd direct you to this documentation and a paper:

Offline deep RL still needs a lot of inventions to make it practical :sweat:

spsingh37 commented 6 months ago

Ohh.....do you know whether during training the transitions are sampled randomly from the replay buffer (if not how to randomly shuffle the transitions)?

takuseno commented 6 months ago

Yes, the mini-batch is uniformly sampled from the buffer.

takuseno commented 6 months ago

Another suggestion to prevent the divergence is using offline RL algorithms. For now, it looks like you're using DQN, which is designed for online training. If you use DiscreteCQL instead, you might get the better results in offline setting.

takuseno commented 5 months ago

Please let me close this issue since it's simply a nature of offline RL.