vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.02k stars 575 forks source link

Dreamer v1 / v2 [Model-based RL] #345

Closed dosssman closed 8 months ago

dosssman commented 1 year ago

Description

Types of changes

Checklist:

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Jan 23, 2023 at 11:51PM (UTC)
vwxyzjn commented 1 year ago

FWIW: https://www.reddit.com/r/reinforcementlearning/comments/108t325/dreamv3_mastering_diverse_domains_through_world/

dosssman commented 1 year ago

Preliminary results for Atari Pong and Breakout available on WANDB (did not want to clutter the cleanrl project too much), but will add runs once we have convene on a more or less final structure for the code.

While the code in dreamerv2_atari.py is pretty much done and working, I was thinking of asking for review once I finished up the documentations to justify some of the design choices. However, would appreciate some early feedback especially on the parts that do not match CleanRL's coding style.

Otherwise, will try to get done with the docs and adding the baseline comparison plots ASAP. Thanks.

sai-prasanna commented 1 year ago

@dosssman How can I help? I have experience with dreamer and have ported a replica of dreamerv2 to torch. I would very much love having a cleanRL impl of it (or preferably dreamerv3 for future experimentation).

dosssman commented 1 year ago

@sai-prasanna Thanks a lot for chimming in.

Right now, there is a functional implementation of v1 and v2 here, albeit not necessarily as simple as what CleanRL aims to provide..

Would greatly appreciate another pair of eyes going over it, if even summarily, to

  1. Point out some parts that are not really cleanrl-ish, because of some design choices owing to the MbRL nature of the algorithm
  2. Point out the parts that might be hard to understand

Currently working on the documentation while accounting to the difference in implementation compared to baseline. For example, this implementation uses Truncated Back Propagation Through Time which samples batch of sequential trajectories (I went for TBPTT over the default version because in the subsequent memory-maze paper they have shown that TBPTT does work better than sampling non-contiguous batch of trajectories for the training. Furthermore, it just feels more logical to do.

A rough outline of the documentation that expands a bit more on the design choice is available here. Apologies in advance for the roughness though.

Once we can convene on a standard for MbRL methods, or at least Dreamer type agents, we can use this as basis to extend toward Dreamer v3, hopefully, which feels more like a pack of practical implementation details and well generalizing hyper parameter sets, without changing much of the underlying logic of the algorithm.

Overall, I think Dreamer v1 and v2 are probably better to understand the different parts of the algorithms than Dreamer v3. The latter is more oriented toward juicing out performance over tasks by using practical / implementation tweaks and methods, which I think are orthogonal to the core theory underlying the algorithm. Of course, it will be a nice addition latter, on top of v1 and v2.

Another thing that might be worth doing is also adding support for DMC / Mujoco task support based on dreamer_atari.py. The required changes would essential be to add support for the continuous control environment itself, as well as changing adapting the ActorHead method here. The rest should be working out of the box.

dosssman commented 1 year ago

Another aspect that could probably benefit from improving it would be the slow training speed of the algorithm as is. Did my best to cut out most bottleneck, but training still takes a long time namely owing to the for loop over the batch length for the dynamics estimation by the RSSM (GRU cell).

Would be nice to find a way to a) either improve the training speed in Pytorch (JIT, functorch, etc... ?), or 2) port it to JAX once we have convened on the final Pytorch version. Last option c) would deviate from the original hyperparameters by using shorter batch length T=20 instead of T-50 (default) to reduce the RNN related bottleneck while still getting good enough results thanks to the TBPTT. Did some preliminary tests on Atari Pong, and using B = 50 and T = 20 does not seem to hurt that much.

I think this it is important to reduce the training time of this algorithm to make it more affordable to experiment with.

dosssman commented 1 year ago

In case getting the code under this fork / branch running poses some problems, here is how I setup the environment. Probably best to checkout the latest commit of this PR's branch.

For recent GPU (RTX family and CUDA >=11.6):

conda create -n cleanrl-mbrl-dreamer python=3.9 -y
conda activate cleanrl-mbrl-dreamer
# Poetry install inside conda
pip install poetry
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
# Atari support
poetry install --with atari
# Depending on the GPU used, overriding Pytorch and CUDA version can help train faster
# Namely, for GTX 1080 Ti and similar use Torch 1.10.2 and CUDA 10.2 instead of 11.3
# conda install pytorch=1.10.2 torchvision torchaudio cudatoolkit=10.2 -c pytorch -y
# Jupyter kernel support
pip install ipykernel
# Video logging support with TensorboardX
pip install moviepy torchinfo

For older GPUs such as GTX 1080 Ti:

# Older python 3.8 + CUDA 10.2 and Pytorch 1.10.2 for compat ?
conda create -n cleanrl-mbrl-cleanrl-10.2 python=3.8 -y
conda activate ...
poetry install --with atari
conda install pytorch=1.10.2 torchvision torchaudio cudatoolkit=10.2 -c pytorch -y
pip install moviepy

Hope it helps.

sai-prasanna commented 1 year ago

@dosssman Thanks for your well thought out, detailed plan for action! I will ease my way into these tasks starting from a review of the existing code soon. After that I will take a stab on atari/continuous action space support.

Yep, TBTT makes sense for long horizon credit assignment, for short horizons, using zeros as hidden start state potentially acts as a regularizer. But so long as there is no performance difference, I think TBTT as is a the best default.

I agree your point on dreamer v3. It's going to be purely few changes to reward scaling, hyper parameters, and value function implementation (they use a distributional RL type value network prediction).

sai-prasanna commented 1 year ago

@dosssman Sorry, I couldn't do as even the little I planned to. Taking a high level look on the code, I am not sure if encapsulating the training code in world model and the actor-critic is "clean-rl"-like. But that's purely from comparing with model-free algorithms where there is only a single training block without too much abstractions.

If we are going for a single training code block type thing we should simply the world model & Actor-Critic into dumb torch modules. And extract the train/imagine functions out, either into a single train function or small pure functions.

But if you think the original dreamer approach is cleaner, then there isn't much to do.

dosssman commented 1 year ago

@sai-prasanna No worries. I know life get sin the way haha. Thanks a lot for the feedback. WIll try to make a draft of the variant with the training and imagine functions as a block. Might probably be easier to understand, especially for the actor critic part.

The current Dreamer like approach is actually based on some of my research projects, where I need to easily swap different World Models and Actor Critic type, but this is probably not that relevant in this case.