takuseno / d3rlpy

An offline deep reinforcement learning library
https://takuseno.github.io/d3rlpy
MIT License
1.29k stars 230 forks source link

Reproduce MOPO results #101

Open jdchang1 opened 3 years ago

jdchang1 commented 3 years ago

Hi @takuseno,

I have been trying to reproduce the MOPO results using your library and I have been having trouble. I have been following your MOPO script in the reproduce directory and have been messing around with training the dynamics ensemble for much longer. However even after training for more than 500 epochs I fail to get an evaluation score higher than 7 for Hopper-medium (d4rl). Could you provide any pointers?

Some metrics that seem suspicious are the SAC losses. There seems to be a blowup in the critic loss. Thanks!

takuseno commented 3 years ago

@jdchang1 Hello, thank you for the issue. Currently, I did not spend much time on checking MOPO's performance for now. But, very recently, the d4rl dataset conversion was fixed. https://github.com/takuseno/d3rlpy/commit/8e141c043db7a551875791c2c76db89cc140038f It might be worth trying the same for now with the latest master branch.

TakuyaHiraoka commented 2 years ago

Hi @takuseno @jdchang1,

MOPO implemented in d3rlpy does not terminate its model rollout at terminal states. The original MOPO evaluates whether the generated state is terminal or not by accessing the termination function of the true environment, and terminates the rollout at terminal states (lines 408 -- 446 in [1]).

I found that this model rollout termination improves d3rl MOPO's performance and gets it closer to the original one. (d3rl MOPO with a quick patch for the model rollout termination (and its evaluation result) are available at [2])

[1] https://github.com/tianheyu927/mopo/blob/master/mopo/algorithms/mopo.py [2] https://drive.google.com/file/d/1GvHWJj3sU1wl7NGxMibee-ZbIOt65Smb/view?usp=sharing

takuseno commented 2 years ago

@TakuyaHiraoka Thanks for the info! I never realized they did a very tricky hack. It seems not very practical in general. I would rather add a classifier trained to estimate terminal flags. Probably this explains the COMBO issue too.

IantheChan commented 2 years ago

Actually, in the official implementation of Morel, the authors use the same trick. Maybe using a known termination function is common in model-based offline RL?

ZishunYu commented 2 years ago

I agree with @IantheChan. To my knowledge, some model-based RL works uses this trick, which seems to be very critical for model-based RL. I find this makes sense for robot learning as it is somehow not too difficult to check whether a robot sensor status is feasible.