Closed typoverflow closed 1 year ago
After some digging I found more differences between the original implementation and CORL:
tanh
, while CORL simply clips the output to [-1, 1];Thanks for your time.
Hi, @typoverflow Nice catch! As the original implementation of IQL is based on Jax, there may be some questions regarding the implementation. I'm just wondering if you've checked the results after (1), (2), and (3) changes.
Not really. I'll see if I can launch some experiments ablating the normalization strategy and (1). For (2), I don't think this matters a lot in offline settings; and for (3), considering that CORL only uses deterministic policy for hopper-medium-replay-v2, I'd like to know whether you have ablated this choice when benchmarking.
Hello @typoverflow!
Talking about (3), yes we ablated this choice because our goal was to provide algorithms implementations and configs which perform as well as they perform in original papers. I agree that using deterministic policy in some cases seems wierd but without it algorithm wouldn't achive reported score. From my point of view the core problem here is that our implementations use PyTorch and not Jax so some of the hyperparameters and design choices may work differently. If you have any sugestions about improving our implementation and you are ready to check it we are happy to consider it and rebenchmark improved implementation.
Hi @DT6A, Thanks for your explanation. From my recent experiment with CORL as well as my implementation of IQL and XQL, I come to a similar discovery that keeping the same configs as the original paper fails to achieve the reported score. I agree that different choice of autograd framework results in different preferences for parameters and design choices, and I found that the most significant (or important) differences are:
Thanks again, and I will see if I can check the jax implementation once I fix the dependency issues about jax =( Closing this issue.
Hi, Thanks for providing CORL to fairly benchmark offline algorithms. I have some questions about the details. In IQL, the author's implementation normalizes the reward only (see https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/train_offline.py#L35), while CORL normalizes observation for halfcheetah, hopper and walker2d tasks while leaving the rewards unchanged. Is it necessary to match CORL's normalization strategy to what was used in the original implementation? Please correct me if there is any mis-understanding =)