Add PPO + Transformer-XL

MarcoMeter commented 5 months ago

Description

Implementation of PPO with Transformer-XL as episodic memory. Based on this repo and paper.

Types of changes

[x] New algorithm

Checklist:

[x] I've read the CONTRIBUTION guide (required).
[x] I have ensured pre-commit run --all-files passes (required).
[x] I have updated the tests accordingly (if applicable).
[x] I have updated the documentation and previewed the changes via mkdocs serve.
- [x] I have explained note-worthy implementation details.
- [x] I have explained the logged metrics.
- [x] I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

[x] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
[x] I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture_video.
[x] I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - [x] I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [x] I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- [x] I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel[bot] commented 5 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 18, 2024 4:49am

MarcoMeter commented 5 months ago

pre-commit

pre-commit fails because of two "obsolet" imports: memory_gym and PoMEnv. Without those imports, the environments are not registered inside gymnasium.

enjoy.py

I added a script to load a trained model and then watch an episode.

ProofofMemory-v0 and MiniGrid-MemoryS9-v0

These environments require memory and converge pretty fast. That's why I included those initially. MemoryGym environments take in more time and resources (especially GPU memory due to the cached hidden states of Transformer-XL).

TODO

I still have to run the benchmarks and write documentation. Besides that, the single file implementation is basically done. I tried to stay close to ppo_atari_lstm.py

roger-creus commented 2 months ago

Hey! This looks pretty impressive! Just curious, what is the state of this PR?

MarcoMeter commented 2 months ago

Hi @roger-creus the benchmarks just completed. So the next step is to prepare the reports and then to write the docs.

roger-creus commented 2 months ago

Nice! Looking forward to the results

MarcoMeter commented 2 months ago

It reproduces the results of my paper: https://arxiv.org/abs/2309.17207

and this is the original implementation: https://github.com/MarcoMeter/neroRL

roger-creus commented 2 months ago

I'm curious about how it performs in other environments (e.g. atari?)

MarcoMeter commented 3 weeks ago

IMHO, here are the remaining TODOs of this PR:

[x] Upload trained models to HuggingFace
[x] Download and run these models using /cleanrl/ppo_trxl/enjoy.py
[x] Rename blocks to layers (e.g. trxl_num_layers or TransformerLayer(nn.Module))
[x] pre-commit still needs to pass
- It fails due to the "unused" import of memory_gym and PoMEnv
- If memory_gym is not imported, the environments are not registered
- Suggestions on this @vwxyzjn ?
- Solution: #noqa
[ ] Keep or remove the Proof of Memory environment (cleanrl/ppo_trxl/pom_env.py)?
- As an alternative Minigrid-Memory can be used as a much smaller training problem when compared to memory-gym

@roger-creus I don't have results on Atari.

vwxyzjn commented 2 weeks ago

Keep or remove the Proof of Memory environment (cleanrl/ppo_trxl/pom_env.py)?

Feel free to keep it.

Do you know why the wandb chart looks like this?

MarcoMeter commented 2 weeks ago

Do you know why the wandb chart looks like this?

What are you referring to? This is how I created the report:

@echo off
python -m openrlbenchmark.rlops ^
    --filters "?we=openrlbenchmark&wpn=cleanRL&ceik=env_id&cen=exp_name&metric=episode/r_mean" ^
    "ppo_trxl?cl=PPO-TrXL" ^
    --env-ids MortarMayhem-Grid-v0 MortarMayhem-v0 Endless-MortarMayhem-v0 MysteryPath-Grid-v0 MysteryPath-v0 Endless-MysteryPath-v0 SearingSpotlights-v0 Endless-SearingSpotlights-v0 ^
    --no-check-empty-runs ^
    --pc.ncols 3 ^
    --pc.ncols-legend 3 ^
    --rliable ^
    --rc.score_normalization_method maxmin ^
    --rc.normalized_score_threshold 1.0 ^
    --rc.sample_efficiency_plots ^
    --rc.sample_efficiency_and_walltime_efficiency_method Median ^
    --rc.performance_profile_plots ^
    --rc.aggregate_metrics_plots ^
    --rc.sample_efficiency_num_bootstrap_reps 10 ^
    --rc.performance_profile_num_bootstrap_reps 10 ^
    --rc.interval_estimates_num_bootstrap_reps 10 ^
    --output-filename memgym/compare ^
    --scan-history ^
    --report

Thanks for your feedback =)

vwxyzjn commented 2 weeks ago

Oh I meant the error bar (shadow region) is very large for some reason, but it’s fine. I have added you to the list of contributors. Feel free to merge after CI passes.

MarcoMeter commented 2 weeks ago

It seems that other reports have this as well, like: https://wandb.ai/openrlbenchmark/cleanrl/reports/CleanRL-PPG-vs-PPO-results--VmlldzoyMDY2NzQ5

MarcoMeter commented 2 weeks ago

I did some refinements:

Added hyperparameters to the docs for training MiniGrid-Memory-S9-v0 and ProofOfMemory-v0
Added pre-trained models to huggingface for these envs
ProofOfMemory-v0 can be adequately rendered now
Added link to ppo_trxl.py in README.md

My last step before merging is to make sure that poetry and the dependencies blend well.

MarcoMeter commented 2 weeks ago

My last step before merging is to make sure that poetry and the dependencies blend well.

Done.

vwxyzjn / cleanrl