Some questions on reproducibility of IQL

tinkoff-ai / CORL

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC

https://arxiv.org/abs/2210.07105

Apache License 2.0

1.08k stars 131 forks source link

Some questions on reproducibility of IQL #33

Closed typoverflow closed 1 year ago

typoverflow commented 1 year ago

Hi, Thanks for providing CORL to fairly benchmark offline algorithms. I have some questions about the details. In IQL, the author's implementation normalizes the reward only (see https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/train_offline.py#L35), while CORL normalizes observation for halfcheetah, hopper and walker2d tasks while leaving the rewards unchanged. Is it necessary to match CORL's normalization strategy to what was used in the original implementation? Please correct me if there is any mis-understanding =)

typoverflow commented 1 year ago

After some digging I found more differences between the original implementation and CORL:

It seems that the original one did termination fix when loading the offline dataset (see https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/dataset_utils.py#L85), while CORL didn't;
The original implementation squashes the output of policy with tanh, while CORL simply clips the output to [-1, 1];
Also I noticed that the original one used tanh-normal policy everywhere, while CORL used deterministic policy for hopper-medium-replay-v2. Could you explain more on this?

Thanks for your time.

Scitator commented 1 year ago

Hi, @typoverflow Nice catch! As the original implementation of IQL is based on Jax, there may be some questions regarding the implementation. I'm just wondering if you've checked the results after (1), (2), and (3) changes.

typoverflow commented 1 year ago

Not really. I'll see if I can launch some experiments ablating the normalization strategy and (1). For (2), I don't think this matters a lot in offline settings; and for (3), considering that CORL only uses deterministic policy for hopper-medium-replay-v2, I'd like to know whether you have ablated this choice when benchmarking.

DT6A commented 1 year ago

Hello @typoverflow!

Talking about (3), yes we ablated this choice because our goal was to provide algorithms implementations and configs which perform as well as they perform in original papers. I agree that using deterministic policy in some cases seems wierd but without it algorithm wouldn't achive reported score. From my point of view the core problem here is that our implementations use PyTorch and not Jax so some of the hyperparameters and design choices may work differently. If you have any sugestions about improving our implementation and you are ready to check it we are happy to consider it and rebenchmark improved implementation.

typoverflow commented 1 year ago

Hi @DT6A, Thanks for your explanation. From my recent experiment with CORL as well as my implementation of IQL and XQL, I come to a similar discovery that keeping the same configs as the original paper fails to achieve the reported score. I agree that different choice of autograd framework results in different preferences for parameters and design choices, and I found that the most significant (or important) differences are:

stochastic policy v.s. deterministic policy in hopper tasks;
normalization strategy; and
the temperature when extracting the policy from value.

Thanks again, and I will see if I can check the jax implementation once I fix the dependency issues about jax =( Closing this issue.