Unable to reproduce results on hopper-medium-expert-v2

odelalleau commented 1 year ago

The README shows a last score of 106.24±6.09 for IQL on hopper-medium-expert-v2, with the following plot found on the linked dashboard:

However, when I try to repro a similar plot with the latest code, running python algorithms/iql.py --config=configs/iql/hopper/medium_expert_v2.yaml with 5 different seeds, I get something like this, which is clearly worse (last score is around 80):

Any idea where this discrepancy may be coming from?

vkurenkov commented 1 year ago

Hi,

Compared the configs in the wandb and the provided one, they look similar. Did you use the same set of seeds (0, 1, 2, 3, 4)?

odelalleau commented 1 year ago

Yes I used exactly those seeds

vkurenkov commented 1 year ago

Hm, this is worrying. At the moment, there are two possible reasons: (1) either we broke something refactoring the code 😿 (2) seed instability. For the first one, I could not spot any significant difference so far.

I uploaded the exact same source code we used for running the experiments, maybe you could run (or check for major differences) the parameters saved by the wandb and tell us if it helped?

odelalleau commented 1 year ago

I uploaded the exact same source code we used for running the experiments, maybe you could run (or check for major differences) the parameters saved by the wandb and tell us if it helped?

Thanks, can you please also share the exact config you are using? (the new code you shared isn't compatible with the config from the repo -- I could fix it manually but having the true config would be even better)

vkurenkov commented 1 year ago

You can use the config values directly from the wandb.

odelalleau commented 1 year ago

You can use the config values directly from the wandb.

Got it, thanks! Experiment is running, I'll report back when I have results.

In the meantime, FYI I noticed a couple of small mistakes in the wandb config you linked:

Filename is IQL_halfcheetah-medium-expert-v2 while it should be IQL_hopper-medium-expert-v2
There's a typo in iql_determenistic, it should be iql_deterministic

vkurenkov commented 1 year ago

Yeah, the first is a typo and does not affect the training (as the filename is only used for checkpoints saving in the older version). For the second one, it is a typo in the first implementation as well (but the config should be fine and match).

Waiting for the results, thank you very much for reporting this.

vkurenkov commented 1 year ago

By the way, what can also work out -- try to increase the batch size. We recently had a paper on how scaling the batch size speeds up the convergence of ensemble-based methods, however, it also works for td3+bc, iql, and awac (we did not test any others). Overall, large batches stabilize the training procedure for most of the datasets.

For this exact dataset we talk about -- you can reuse the original config, but change the following values (you will also need to add learning_rate as a config parameter and put it to the optimizers, this is not on the main branch yet):

batch_size: 10000
learning_rate: 0.0018 (simply square root scaling)
max_timesteps: 300000 (you can actually use way less, but just to make sure)

You should get something like this

odelalleau commented 1 year ago

I'll post the final result tomorrow, but 3 seeds in it doesn't look good (it's actually even worse than what I had in my original post):

I used the iql.py file you shared, with the following config which should match the W&B run you pointed me to:

batch_size: 256
beta: 6.0
device: cuda
discount: 0.99
env: hopper-medium-expert-v2
eval_freq: 5000
file_name: IQL_hopper-medium-expert-v2
iql_determenistic: false
iql_tau: 0.5
load_model: ''
max_timesteps: 1000000
n_seeds: 5
normalize: true
policy: IQL
save_model: false
save_path: "./models/iql"
seed: 0
tau: 0.005
use_wandb: true

Thanks for sharing the tip on the batch size, this is definitely interesting and I'll have a look! (though right now my priority is to get it to work as intended with a smaller batch, for comparison with other methods)

odelalleau commented 1 year ago

Final result with 5 seeds => the last two seeds helped a bit but this is still not as good as your W&B runs:

Would you be able to push a branch at commit 9eb4e1f7d97bc6765bf620f229d95fa1a5b02443 by any chance? (this is the commit used according to W&B, but it doesn't seem to be available in the repo right now)

vkurenkov commented 1 year ago

The file I've attached is from that commit. Not sure what's the reason can be at the moment (we will probably need to re-run IQL). However, I've checked the installed dependencies in the wandb runs, and there seems to be discrepancies from the current requirements. There are several ones that can potentially affect the performance:

gym 0.22 (vs 0.23 currently)
torch 1.10 (vs 1.11 currently)
numpy 1.21.2 (vs 1.23.2 currently)

odelalleau commented 1 year ago

there seems to be discrepancies from the current requirements. There are several ones that can potentially affect the performance:

Ok I'll try with the older versions as well.

Could d4rl also be a potential suspect? (I see it's grabbing the master version)

NB: I'm using requirements_dev.txt as per the README

odelalleau commented 1 year ago

Just as an FYI this is the plot with 10K batch size and lr=0.0018. Definitely better though not as good as the one you showed:

vkurenkov commented 1 year ago

Yep, this looks like an issue with some of the dependencies as the graph I've shown above was obtained just a week ago. Have you tried running an isolated environment using our Dockerfile?

odelalleau commented 1 year ago

Re-run with the versions of gym / torch / numpy you mentioned (back to batch size 256 and lr=3e-4) is not 100% done yet, but at this point it's safe to say the issue is not resolved:

Would you be able to dump here the output of pip list on a working env by any chance? There might be secondary dependencies (e.g., mujoco) that could differ as well.

(I'm afraid I can't easily try a docker image as docker is not supported on my cluster)

vkurenkov commented 1 year ago

Hera are the requirements requirements_lb_iql.txt

odelalleau commented 1 year ago

Hera are the requirements requirements_lb_iql.txt

Thanks, I setup a new env with these requirements (had to tweak a few things to get it to install on my machine, but nothing major I think) ==> didn't make any difference (it's still worse).

I wonder if there may be some hardware difference that could impact the results, in spite of using the same seeds. I'll run a different batch of 10 seeds (reverting back to the original code / env) just to see what happens.

odelalleau commented 1 year ago

I wonder if there may be some hardware difference that could impact the results, in spite of using the same seeds. I'll run a different batch of 10 seeds (reverting back to the original code / env) just to see what happens.

Still the same (was from seeds 10 -> 19):

At this point I'm not sure what to try, besides figuring out a way to run the Docker image. It'd be good if someone else was able to run an independent test to confirm whether the problem is on my end.

typoverflow commented 1 year ago

Hi, I ran IQL from CORL for five seeds some time ago, and the same issue happened for me. Below is the performance curve, and you can see that the main issue is instability. The log is available at https://wandb.ai/gaochenxiao/test_iql.

However, I tested my own implementation of IQL with CORL's config, and the result is fine:

So I suspect there may be something tricky in, for example, architectural design that CORL differs from the original implementation. Better to have a check.

odelalleau commented 1 year ago

However, I tested my own implementation of IQL with CORL's config, and the result is fine

Thanks for confirming! Would this implementation be available publicly by any chance?

typoverflow commented 1 year ago

@odelalleau Sure, the implementation is at https://github.com/typoverflow/OfflineRL-Lib/blob/master/reproduce/iql/run_iql_offline.py

@vkurenkov After a rough checking I found a tiny mismatch between CORL's IQL and the original one: in the Gaussian Actor, the mean should be squashed by tanh before being used to construct the multivariate normal. Note that we are squashing the mean before Normal rather than squashing the sample after the Normal, so there is no need to correct the log-prob. I made a modification to CORL last night and launched one experiment on hopper-medium-expert-v2, and the curve seems to be way better:

Maybe we can add this trick and re-benchmark IQL?

vkurenkov commented 1 year ago

@typoverflow this is a nice catch, thank you!

Can you make a PR with the fix? I will run a battery of experiments based on it to update the scores

typoverflow commented 1 year ago

Sure, PR is at #41

vkurenkov commented 1 year ago

@odelalleau you can try the IQL from this PR https://github.com/tinkoff-ai/CORL/pull/42

We will rerun all of the datasets within a couple of weeks, but preliminary results suggest that your issue should be resolved. Let us know if it helped

tinkoff-ai / CORL

Unable to reproduce results on hopper-medium-expert-v2 #38