tinkoff-ai / CORL

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC
https://arxiv.org/abs/2210.07105
Apache License 2.0
1.08k stars 131 forks source link
d4rl gym offline-reinforcement-learning reinforcement-learning

CORL (Clean Offline Reinforcement Learning)

Twitter arXiv Ruff

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!


Getting started

git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
pip install -r requirements/requirements_dev.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run --gpus=all -it --rm --name <container_name> <image_name>

Algorithms Implemented

Algorithm Variants Implemented Wandb Report
Offline and Offline-to-Online
Conservative Q-Learning for Offline Reinforcement Learning
(CQL)
offline/cql.py
finetune/cql.py
Offline

Offline-to-online
Accelerating Online Reinforcement Learning with Offline Datasets
(AWAC)
offline/awac.py
finetune/awac.py
Offline

Offline-to-online
Offline Reinforcement Learning with Implicit Q-Learning
(IQL)
offline/iql.py
finetune/iql.py
Offline

Offline-to-online
Offline-to-Online only
Supported Policy Optimization for Offline Reinforcement Learning
(SPOT)
finetune/spot.py Offline-to-online
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
(Cal-QL)
finetune/cal_ql.py Offline-to-online
Offline only
✅ Behavioral Cloning
(BC)
offline/any_percent_bc.py Offline
✅ Behavioral Cloning-10%
(BC-10%)
offline/any_percent_bc.py Offline
A Minimalist Approach to Offline Reinforcement Learning
(TD3+BC)
offline/td3_bc.py Offline
Decision Transformer: Reinforcement Learning via Sequence Modeling
(DT)
offline/dt.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(SAC-N)
offline/sac_n.py Offline
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble
(EDAC)
offline/edac.py Offline
Revisiting the Minimalist Approach to Offline Reinforcement Learning
(ReBRAC)
offline/rebrac.py Offline
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size
(LB-SAC)
offline/lb_sac.py Offline Gym-MuJoCo

D4RL Benchmarks

You can check the links above for learning curves and details. Here, we report reproduced final and best scores. Note that they differ by a significant margin, and some papers may use different approaches, not making it always explicit which reporting methodology they chose. If you want to re-collect our results in a more structured/nuanced manner, see results.

Offline

Last Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 42.40 ± 0.19 42.46 ± 0.70 48.10 ± 0.18 49.46 ± 0.62 47.04 ± 0.22 48.31 ± 0.22 64.04 ± 0.68 68.20 ± 1.28 67.70 ± 1.04 42.20 ± 0.26
halfcheetah-medium-replay-v2 35.66 ± 2.33 23.59 ± 6.95 44.84 ± 0.59 44.70 ± 0.69 45.04 ± 0.27 44.46 ± 0.22 51.18 ± 0.31 60.70 ± 1.01 62.06 ± 1.10 38.91 ± 0.50
halfcheetah-medium-expert-v2 55.95 ± 7.35 90.10 ± 2.45 90.78 ± 6.04 93.62 ± 0.41 95.63 ± 0.42 94.74 ± 0.52 103.80 ± 2.95 98.96 ± 9.31 104.76 ± 0.64 91.55 ± 0.95
hopper-medium-v2 53.51 ± 1.76 55.48 ± 7.30 60.37 ± 3.49 74.45 ± 9.14 59.08 ± 3.77 67.53 ± 3.78 102.29 ± 0.17 40.82 ± 9.91 101.70 ± 0.28 65.10 ± 1.61
hopper-medium-replay-v2 29.81 ± 2.07 70.42 ± 8.66 64.42 ± 21.52 96.39 ± 5.28 95.11 ± 5.27 97.43 ± 6.39 94.98 ± 6.53 100.33 ± 0.78 99.66 ± 0.81 81.77 ± 6.87
hopper-medium-expert-v2 52.30 ± 4.01 111.16 ± 1.03 101.17 ± 9.07 52.73 ± 37.47 99.26 ± 10.91 107.42 ± 7.80 109.45 ± 2.34 101.31 ± 11.63 105.19 ± 10.08 110.44 ± 0.33
walker2d-medium-v2 63.23 ± 16.24 67.34 ± 5.17 82.71 ± 4.78 66.53 ± 26.04 80.75 ± 3.28 80.91 ± 3.17 85.82 ± 0.77 87.47 ± 0.66 93.36 ± 1.38 67.63 ± 2.54
walker2d-medium-replay-v2 21.80 ± 10.15 54.35 ± 6.34 85.62 ± 4.01 82.20 ± 1.05 73.09 ± 13.22 82.15 ± 3.03 84.25 ± 2.25 78.99 ± 0.50 87.10 ± 2.78 59.86 ± 2.73
walker2d-medium-expert-v2 98.96 ± 15.98 108.70 ± 0.25 110.03 ± 0.36 49.41 ± 38.16 109.56 ± 0.39 111.72 ± 0.86 111.86 ± 0.43 114.93 ± 0.41 114.75 ± 0.74 107.11 ± 0.96
locomotion average 50.40 69.29 76.45 67.72 78.28 81.63 89.74 83.52 92.92 73.84
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 0.36 ± 8.69 12.18 ± 4.29 29.41 ± 12.31 82.67 ± 28.30 -8.90 ± 6.11 42.11 ± 0.58 106.87 ± 22.16 130.59 ± 16.52 95.26 ± 6.39 18.08 ± 25.42
maze2d-medium-v1 0.79 ± 3.25 14.25 ± 2.33 59.45 ± 36.25 52.88 ± 55.12 86.11 ± 9.68 34.85 ± 2.72 105.11 ± 31.67 88.61 ± 18.72 57.04 ± 3.45 31.71 ± 26.33
maze2d-large-v1 2.26 ± 4.39 11.32 ± 5.10 97.10 ± 25.41 209.13 ± 8.19 23.75 ± 36.70 61.72 ± 3.50 78.33 ± 61.77 204.76 ± 1.19 95.60 ± 22.92 35.66 ± 28.20
maze2d average 1.13 12.58 61.99 114.89 33.65 46.23 96.77 141.32 82.64 28.48
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 55.25 ± 4.15 65.75 ± 5.26 70.75 ± 39.18 57.75 ± 10.28 92.75 ± 1.92 77.00 ± 5.52 97.75 ± 1.48 0.00 ± 0.00 0.00 ± 0.00 57.00 ± 9.82
antmaze-umaze-diverse-v2 47.25 ± 4.09 44.00 ± 1.00 44.75 ± 11.61 58.00 ± 7.68 37.25 ± 3.70 54.25 ± 5.54 83.50 ± 7.02 0.00 ± 0.00 0.00 ± 0.00 51.75 ± 0.43
antmaze-medium-play-v2 0.00 ± 0.00 2.00 ± 0.71 0.25 ± 0.43 0.00 ± 0.00 65.75 ± 11.61 65.75 ± 11.71 89.50 ± 3.35 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-medium-diverse-v2 0.75 ± 0.83 5.75 ± 9.39 0.25 ± 0.43 0.00 ± 0.00 67.25 ± 3.56 73.75 ± 5.45 83.50 ± 8.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-play-v2 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 20.75 ± 7.26 42.00 ± 4.53 52.25 ± 29.01 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.00 ± 0.00 0.75 ± 0.83 0.00 ± 0.00 0.00 ± 0.00 20.50 ± 13.24 30.25 ± 3.63 64.00 ± 5.43 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 17.21 19.71 19.33 19.29 50.71 57.17 78.42 0.00 0.00 18.12
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 71.03 ± 6.26 26.99 ± 9.60 -3.88 ± 0.21 81.12 ± 13.47 13.71 ± 16.98 78.49 ± 8.21 103.16 ± 8.49 6.86 ± 5.93 5.07 ± 6.16 67.68 ± 5.48
pen-cloned-v1 51.92 ± 15.15 46.67 ± 14.25 5.13 ± 5.28 89.56 ± 15.57 1.04 ± 6.62 83.42 ± 8.19 102.79 ± 7.84 31.35 ± 2.14 12.02 ± 1.75 64.43 ± 1.43
pen-expert-v1 109.65 ± 7.28 114.96 ± 2.96 122.53 ± 21.27 160.37 ± 1.21 -1.41 ± 2.34 128.05 ± 9.21 152.16 ± 6.33 87.11 ± 48.95 -1.55 ± 0.81 116.38 ± 1.27
door-human-v1 2.34 ± 4.00 -0.13 ± 0.07 -0.33 ± 0.01 4.60 ± 1.90 5.53 ± 1.31 3.26 ± 1.83 -0.10 ± 0.01 -0.38 ± 0.00 -0.12 ± 0.13 4.44 ± 0.87
door-cloned-v1 -0.09 ± 0.03 0.29 ± 0.59 -0.34 ± 0.01 0.93 ± 1.66 -0.33 ± 0.01 3.07 ± 1.75 0.06 ± 0.05 -0.33 ± 0.00 2.66 ± 2.31 7.64 ± 3.26
door-expert-v1 105.35 ± 0.09 104.04 ± 1.46 -0.33 ± 0.01 104.85 ± 0.24 -0.32 ± 0.02 106.65 ± 0.25 106.37 ± 0.29 -0.33 ± 0.00 106.29 ± 1.73 104.87 ± 0.39
hammer-human-v1 3.03 ± 3.39 -0.19 ± 0.02 1.02 ± 0.24 3.37 ± 1.93 0.14 ± 0.11 1.79 ± 0.80 0.24 ± 0.24 0.24 ± 0.00 0.28 ± 0.18 1.28 ± 0.15
hammer-cloned-v1 0.55 ± 0.16 0.12 ± 0.08 0.25 ± 0.01 0.21 ± 0.24 0.30 ± 0.01 1.50 ± 0.69 5.00 ± 3.75 0.14 ± 0.09 0.19 ± 0.07 1.82 ± 0.55
hammer-expert-v1 126.78 ± 0.64 121.75 ± 7.67 3.11 ± 0.03 127.06 ± 0.29 0.26 ± 0.01 128.68 ± 0.33 133.62 ± 0.27 25.13 ± 43.25 28.52 ± 49.00 117.45 ± 6.65
relocate-human-v1 0.04 ± 0.03 -0.14 ± 0.08 -0.29 ± 0.01 0.05 ± 0.03 0.06 ± 0.03 0.12 ± 0.04 0.16 ± 0.30 -0.31 ± 0.01 -0.17 ± 0.17 0.05 ± 0.01
relocate-cloned-v1 -0.06 ± 0.01 -0.00 ± 0.02 -0.30 ± 0.01 -0.04 ± 0.04 -0.29 ± 0.01 0.04 ± 0.01 1.66 ± 2.59 -0.01 ± 0.10 0.17 ± 0.35 0.16 ± 0.09
relocate-expert-v1 107.58 ± 1.20 97.90 ± 5.21 -1.73 ± 0.96 108.87 ± 0.85 -0.30 ± 0.02 106.11 ± 4.02 107.52 ± 2.28 -0.36 ± 0.00 71.94 ± 18.37 104.28 ± 0.42
adroit average 48.18 42.69 10.40 56.75 1.53 53.43 59.39 12.43 18.78 49.21

Best Scores

Gym-MuJoCo
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
halfcheetah-medium-v2 43.60 ± 0.14 43.90 ± 0.13 48.93 ± 0.11 50.06 ± 0.50 47.62 ± 0.03 48.84 ± 0.07 65.62 ± 0.46 72.21 ± 0.31 69.72 ± 0.92 42.73 ± 0.10
halfcheetah-medium-replay-v2 40.52 ± 0.19 42.27 ± 0.46 45.84 ± 0.26 46.35 ± 0.29 46.43 ± 0.19 45.35 ± 0.08 52.22 ± 0.31 67.29 ± 0.34 66.55 ± 1.05 40.31 ± 0.28
halfcheetah-medium-expert-v2 79.69 ± 3.10 94.11 ± 0.22 96.59 ± 0.87 96.11 ± 0.37 97.04 ± 0.17 95.38 ± 0.17 108.89 ± 1.20 111.73 ± 0.47 110.62 ± 1.04 93.40 ± 0.21
hopper-medium-v2 69.04 ± 2.90 73.84 ± 0.37 70.44 ± 1.18 97.90 ± 0.56 70.80 ± 1.98 80.46 ± 3.09 103.19 ± 0.16 101.79 ± 0.20 103.26 ± 0.14 69.42 ± 3.64
hopper-medium-replay-v2 68.88 ± 10.33 90.57 ± 2.07 98.12 ± 1.16 100.91 ± 1.50 101.63 ± 0.55 102.69 ± 0.96 102.57 ± 0.45 103.83 ± 0.53 103.28 ± 0.49 88.74 ± 3.02
hopper-medium-expert-v2 90.63 ± 10.98 113.13 ± 0.16 113.22 ± 0.43 103.82 ± 12.81 112.84 ± 0.66 113.18 ± 0.38 113.16 ± 0.43 111.24 ± 0.15 111.80 ± 0.11 111.18 ± 0.21
walker2d-medium-v2 80.64 ± 0.91 82.05 ± 0.93 86.91 ± 0.28 83.37 ± 2.82 84.77 ± 0.20 87.58 ± 0.48 87.79 ± 0.19 90.17 ± 0.54 95.78 ± 1.07 74.70 ± 0.56
walker2d-medium-replay-v2 48.41 ± 7.61 76.09 ± 0.40 91.17 ± 0.72 86.51 ± 1.15 89.39 ± 0.88 89.94 ± 0.93 91.11 ± 0.63 85.18 ± 1.63 89.69 ± 1.39 68.22 ± 1.20
walker2d-medium-expert-v2 109.95 ± 0.62 109.90 ± 0.09 112.21 ± 0.06 108.28 ± 9.45 111.63 ± 0.38 113.06 ± 0.53 112.49 ± 0.18 116.93 ± 0.42 116.52 ± 0.75 108.71 ± 0.34
locomotion average 70.15 80.65 84.83 85.92 84.68 86.28 93.00 95.60 96.36 77.49
Maze2d
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
maze2d-umaze-v1 16.09 ± 0.87 22.49 ± 1.52 99.33 ± 16.16 136.61 ± 11.65 92.05 ± 13.66 50.92 ± 4.23 162.28 ± 1.79 153.12 ± 6.49 149.88 ± 1.97 63.83 ± 17.35
maze2d-medium-v1 19.16 ± 1.24 27.64 ± 1.87 150.93 ± 3.89 131.50 ± 25.38 128.66 ± 5.44 122.69 ± 30.00 150.12 ± 4.48 93.80 ± 14.66 154.41 ± 1.58 68.14 ± 12.25
maze2d-large-v1 20.75 ± 6.66 41.83 ± 3.64 197.64 ± 5.26 227.93 ± 1.90 157.51 ± 7.32 162.25 ± 44.18 197.55 ± 5.82 207.51 ± 0.96 182.52 ± 2.68 50.25 ± 19.34
maze2d average 18.67 30.65 149.30 165.35 126.07 111.95 169.98 151.48 162.27 60.74
Antmaze
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
antmaze-umaze-v2 68.50 ± 2.29 77.50 ± 1.50 98.50 ± 0.87 78.75 ± 6.76 94.75 ± 0.83 84.00 ± 4.06 100.00 ± 0.00 0.00 ± 0.00 42.50 ± 28.61 64.50 ± 2.06
antmaze-umaze-diverse-v2 64.75 ± 4.32 63.50 ± 2.18 71.25 ± 5.76 88.25 ± 2.17 53.75 ± 2.05 79.50 ± 3.35 96.75 ± 2.28 0.00 ± 0.00 0.00 ± 0.00 60.50 ± 2.29
antmaze-medium-play-v2 4.50 ± 1.12 6.25 ± 2.38 3.75 ± 1.30 27.50 ± 9.39 80.50 ± 3.35 78.50 ± 3.84 93.50 ± 2.60 0.00 ± 0.00 0.00 ± 0.00 0.75 ± 0.43
antmaze-medium-diverse-v2 4.75 ± 1.09 16.50 ± 5.59 5.50 ± 1.50 33.25 ± 16.81 71.00 ± 4.53 83.50 ± 1.80 91.75 ± 2.05 0.00 ± 0.00 0.00 ± 0.00 0.50 ± 0.50
antmaze-large-play-v2 0.50 ± 0.50 13.50 ± 9.76 1.25 ± 0.43 1.00 ± 0.71 34.75 ± 5.85 53.50 ± 2.50 68.75 ± 13.90 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze-large-diverse-v2 0.75 ± 0.43 6.25 ± 1.79 0.25 ± 0.43 0.50 ± 0.50 36.25 ± 3.34 53.00 ± 3.00 69.50 ± 7.26 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
antmaze average 23.96 30.58 30.08 38.21 61.83 72.00 86.71 0.00 7.08 21.04
Adroit
Task-Name BC 10% BC TD3+BC AWAC CQL IQL ReBRAC SAC-N EDAC DT
pen-human-v1 99.69 ± 7.45 59.89 ± 8.03 9.95 ± 8.19 121.05 ± 5.47 58.91 ± 1.81 106.15 ± 10.28 127.28 ± 3.22 56.48 ± 7.17 35.84 ± 10.57 77.83 ± 2.30
pen-cloned-v1 99.14 ± 12.27 83.62 ± 11.75 52.66 ± 6.33 129.66 ± 1.27 14.74 ± 2.31 114.05 ± 4.78 128.64 ± 7.15 52.69 ± 5.30 26.90 ± 7.85 71.17 ± 2.70
pen-expert-v1 128.77 ± 5.88 134.36 ± 3.16 142.83 ± 7.72 162.69 ± 0.23 14.86 ± 4.07 140.01 ± 6.36 157.62 ± 0.26 116.43 ± 40.26 36.04 ± 4.60 119.49 ± 2.31
door-human-v1 9.41 ± 4.55 7.00 ± 6.77 -0.11 ± 0.06 19.28 ± 1.46 13.28 ± 2.77 13.52 ± 1.22 0.27 ± 0.43 -0.10 ± 0.06 2.51 ± 2.26 7.36 ± 1.24
door-cloned-v1 3.40 ± 0.95 10.37 ± 4.09 -0.20 ± 0.11 12.61 ± 0.60 -0.08 ± 0.13 9.02 ± 1.47 7.73 ± 6.80 -0.21 ± 0.10 20.36 ± 1.11 11.18 ± 0.96
door-expert-v1 105.84 ± 0.23 105.92 ± 0.24 4.49 ± 7.39 106.77 ± 0.24 59.47 ± 25.04 107.29 ± 0.37 106.78 ± 0.04 0.05 ± 0.02 109.22 ± 0.24 105.49 ± 0.09
hammer-human-v1 12.61 ± 4.87 6.23 ± 4.79 2.38 ± 0.14 22.03 ± 8.13 0.30 ± 0.05 6.86 ± 2.38 1.18 ± 0.15 0.25 ± 0.00 3.49 ± 2.17 1.68 ± 0.11
hammer-cloned-v1 8.90 ± 4.04 8.72 ± 3.28 0.96 ± 0.30 14.67 ± 1.94 0.32 ± 0.03 11.63 ± 1.70 48.16 ± 6.20 12.67 ± 15.02 0.27 ± 0.01 2.74 ± 0.22
hammer-expert-v1 127.89 ± 0.57 128.15 ± 0.66 33.31 ± 47.65 129.66 ± 0.33 0.93 ± 1.12 129.76 ± 0.37 134.74 ± 0.30 91.74 ± 47.77 69.44 ± 47.00 127.39 ± 0.10
relocate-human-v1 0.59 ± 0.27 0.16 ± 0.14 -0.29 ± 0.01 2.09 ± 0.76 1.03 ± 0.20 1.22 ± 0.28 3.70 ± 2.34 -0.18 ± 0.14 0.05 ± 0.02 0.08 ± 0.02
relocate-cloned-v1 0.45 ± 0.31 0.74 ± 0.45 -0.02 ± 0.04 0.94 ± 0.68 -0.07 ± 0.02 1.78 ± 0.70 9.25 ± 2.56 0.10 ± 0.04 4.11 ± 1.39 0.34 ± 0.09
relocate-expert-v1 110.31 ± 0.36 109.77 ± 0.60 0.23 ± 0.27 111.56 ± 0.17 0.03 ± 0.10 110.12 ± 0.82 111.14 ± 0.23 -0.07 ± 0.08 98.32 ± 3.75 106.49 ± 0.30
adroit average 58.92 54.58 20.51 69.42 13.65 62.62 69.71 27.49 33.88 52.60

Offline-to-Online

Scores

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 52.75 ± 8.67 → 98.75 ± 1.09 94.00 ± 1.58 → 99.50 ± 0.87 77.00 ± 0.71 → 96.50 ± 1.12 91.00 ± 2.55 → 99.50 ± 0.50 76.75 ± 7.53 → 99.75 ± 0.43
antmaze-umaze-diverse-v2 56.00 ± 2.74 → 0.00 ± 0.00 9.50 ± 9.91 → 99.00 ± 1.22 59.50 ± 9.55 → 63.75 ± 25.02 36.25 ± 2.17 → 95.00 ± 3.67 32.00 ± 27.79 → 98.50 ± 1.12
antmaze-medium-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 59.00 ± 11.18 → 97.75 ± 1.30 71.75 ± 2.95 → 89.75 ± 1.09 67.25 ± 10.47 → 97.25 ± 1.30 71.75 ± 3.27 → 98.75 ± 1.64
antmaze-medium-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 63.50 ± 6.84 → 97.25 ± 1.92 64.25 ± 1.92 → 92.25 ± 2.86 73.75 ± 7.29 → 94.50 ± 1.66 62.00 ± 4.30 → 98.25 ± 1.48
antmaze-large-play-v2 0.00 ± 0.00 → 0.00 ± 0.00 28.75 ± 7.76 → 88.25 ± 2.28 38.50 ± 8.73 → 64.50 ± 17.04 31.50 ± 12.58 → 87.00 ± 3.24 31.75 ± 8.87 → 97.25 ± 1.79
antmaze-large-diverse-v2 0.00 ± 0.00 → 0.00 ± 0.00 35.50 ± 3.64 → 91.75 ± 3.96 26.75 ± 3.77 → 64.25 ± 4.15 17.50 ± 7.26 → 81.00 ± 14.14 44.00 ± 8.69 → 91.50 ± 3.91
antmaze average 18.12 → 16.46 48.38 → 95.58 56.29 → 78.50 52.88 → 92.38 53.04 → 97.33
pen-cloned-v1 88.66 ± 15.10 → 86.82 ± 11.12 -2.76 ± 0.08 → -1.28 ± 2.16 84.19 ± 3.96 → 102.02 ± 20.75 6.19 ± 5.21 → 43.63 ± 20.09 -2.66 ± 0.04 → -2.68 ± 0.12
door-cloned-v1 0.93 ± 1.66 → 0.01 ± 0.00 -0.33 ± 0.01 → -0.33 ± 0.01 1.19 ± 0.93 → 20.34 ± 9.32 -0.21 ± 0.14 → 0.02 ± 0.31 -0.33 ± 0.01 → -0.33 ± 0.01
hammer-cloned-v1 1.80 ± 3.01 → 0.24 ± 0.04 0.56 ± 0.55 → 2.85 ± 4.81 1.35 ± 0.32 → 57.27 ± 28.49 3.97 ± 6.39 → 3.73 ± 4.99 0.25 ± 0.04 → 0.17 ± 0.17
relocate-cloned-v1 -0.04 ± 0.04 → -0.04 ± 0.01 -0.33 ± 0.01 → -0.33 ± 0.01 0.04 ± 0.04 → 0.32 ± 0.38 -0.24 ± 0.01 → -0.15 ± 0.05 -0.31 ± 0.05 → -0.31 ± 0.04
adroit average 22.84 → 21.76 -0.72 → 0.22 21.69 → 44.99 2.43 → 11.81 -0.76 → -0.79

Regrets

Task-Name AWAC CQL IQL SPOT Cal-QL
antmaze-umaze-v2 0.04 ± 0.01 0.02 ± 0.00 0.07 ± 0.00 0.02 ± 0.00 0.01 ± 0.00
antmaze-umaze-diverse-v2 0.88 ± 0.01 0.09 ± 0.01 0.43 ± 0.11 0.22 ± 0.07 0.05 ± 0.01
antmaze-medium-play-v2 1.00 ± 0.00 0.08 ± 0.01 0.09 ± 0.01 0.06 ± 0.00 0.04 ± 0.01
antmaze-medium-diverse-v2 1.00 ± 0.00 0.08 ± 0.00 0.10 ± 0.01 0.05 ± 0.01 0.04 ± 0.01
antmaze-large-play-v2 1.00 ± 0.00 0.21 ± 0.02 0.34 ± 0.05 0.29 ± 0.07 0.13 ± 0.02
antmaze-large-diverse-v2 1.00 ± 0.00 0.21 ± 0.03 0.41 ± 0.03 0.23 ± 0.08 0.13 ± 0.02
antmaze average 0.82 0.11 0.24 0.15 0.07
pen-cloned-v1 0.46 ± 0.02 0.97 ± 0.00 0.37 ± 0.01 0.58 ± 0.02 0.98 ± 0.01
door-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.83 ± 0.03 0.99 ± 0.01 1.00 ± 0.00
hammer-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 0.65 ± 0.10 0.98 ± 0.01 1.00 ± 0.00
relocate-cloned-v1 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00
adroit average 0.86 0.99 0.71 0.89 0.99

Citing CORL

If you use CORL in your work, please use the following bibtex

@inproceedings{
tarasov2022corl,
  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
  author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
  booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
  year={2022},
  url={https://openreview.net/forum?id=SyAS49bBcv}
}