google-research / rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
https://agarwl.github.io/rliable
Apache License 2.0
747 stars 46 forks source link

RAD results may be incorrect. #5

Closed TaoHuang13 closed 2 years ago

TaoHuang13 commented 2 years ago

Hi @agarwl. I found that the 'step' in RAD's 'eval.log' refers to the policy step. But the 'step' in 'xxx--eval_scores.npy' refers to the environment step. We know that 'environment step = policy step * action_repreat'.

Here comes a problem: if you use the results of 100k steps in 'eval.log', then you actually evaluate the scores at 100k*action_repeat steps. This will lead to the overestimation of RAD. And I wonder whether you do such incorrect evaluations, or you take the results in 'xxx--eval_scores.npy', which are correct in terms of 'steps'. You may refer to a similar question in https://github.com/MishaLaskin/rad/issues/15.

I reproduced the results of RAD locally, and I found my results are much worse than the reported ones (in your paper). I list them in the following figure. QQ20211223-153829

I compare the means of each task. Obviously, there is a huge gap, and my results are close to the ones reported by DrQ authors (see the Table in https://github.com/MishaLaskin/rad/issues/1). I guess you may evaluate scores at incorrect environment steps? So, could you please offer more details when evaluating RAD? Thanks :)

agarwl commented 2 years ago

Hi,

Thanks for using rliable! The RAD scores reported in the paper correspond to the RAD results in the NeurIPS'21 paper Tactical Optimism and Pessimism for Deep Reinforcement Learning (see Table 2). These scores were provided by @jparkerholder and @tedmoskovitz and they confirmed that they used the RAD codebase released by original authors, the only difference being using a batch size of 128 as opposed to 512 and reporting results for 10 seeds. They also confirmed that these results are much better than original reported results of the RAD paper.

image

Please let me know if you have any other questions.

TaoHuang13 commented 2 years ago

Thank you @agarwl! I will run the experiments again with a batch size of 128.

TaoHuang13 commented 2 years ago

I found a paper on ArXiv KSL that rerun RAD with a batch size of 128. Their results are very close to ours.

Specifically, in the previous Table 2 from TOP, the results of RAD at 100k corresponds to the policy steps in KSL. For Walker and Finger, 100k steps in TOP = 200k steps in KSL (env_steps). For Cheetah, Cup and Reacher, 100k steps in TOP = 400k steps in KSL (env_steps). For Cartpole, 100k steps in TOP = 800k steps in KSL (env_steps).

For further demonstrations, you can also refer to the appendix part of DrQ where the batch size is set as 128. We find that simply reducing batch size will degenerate the performance. Besides, their 100k performances (with size 128) are consistent with the results reported in KSL. This further demonstrates that RAD results in rliable may be imprecise.

I attached my reproduced results with a batch size of 128 here. Note "step" represents policy steps. So all the following results are evaluated at 100k environment steps (=policy steps times action repeat).

Reacher {"episode": 0.0, "episode_reward": 40.7, "eval_time": 64.27814173698425, "mean_episode_reward": 40.7, "best_episode_reward": 323.0, "step": 0} {"episode": 20.0, "episode_reward": 1.8, "eval_time": 68.5985255241394, "mean_episode_reward": 1.8, "best_episode_reward": 13.0, "step": 5000} {"episode": 40.0, "episode_reward": 288.4, "eval_time": 70.13117337226868, "mean_episode_reward": 288.4, "best_episode_reward": 983.0, "step": 10000} {"episode": 60.0, "episode_reward": 283.2, "eval_time": 67.94298577308655, "mean_episode_reward": 283.2, "best_episode_reward": 1000.0, "step": 15000} {"episode": 80.0, "episode_reward": 392.7, "eval_time": 68.72386384010315, "mean_episode_reward": 392.7, "best_episode_reward": 976.0, "step": 20000} {"episode": 100.0, "episode_reward": 475.1, "eval_time": 71.93403220176697, "mean_episode_reward": 475.1, "best_episode_reward": 974.0, "step": 25000}

Cartpole {"episode": 0.0, "episode_reward": 27.587172624025794, "eval_time": 72.19608211517334, "mean_episode_reward": 27.587172624025794, "best_episode_reward": 28.264482867147944, "step": 0} {"episode": 20.0, "episode_reward": 151.79273327000502, "eval_time": 57.46160936355591, "mean_episode_reward": 151.79273327000496, "best_episode_reward": 219.5658238098457, "step": 2500} {"episode": 40.0, "episode_reward": 240.08731928402068, "eval_time": 61.189491987228394, "mean_episode_reward": 240.08731928402062, "best_episode_reward": 272.81721446561255, "step": 5000} {"episode": 60.0, "episode_reward": 272.05539358415115, "eval_time": 61.13726568222046, "mean_episode_reward": 272.05539358415115, "best_episode_reward": 347.1847985789354, "step": 7500} {"episode": 80.0, "episode_reward": 267.24770799821965, "eval_time": 61.74809670448303, "mean_episode_reward": 267.24770799821965, "best_episode_reward": 314.2134339353986, "step": 10000} {"episode": 100.0, "episode_reward": 281.9628418673293, "eval_time": 61.60150122642517, "mean_episode_reward": 281.9628418673293, "best_episode_reward": 352.25938864664744, "step": 12500}

Cheetah {"episode": 0.0, "episode_reward": 0.19523336669009633, "eval_time": 138.58351063728333, "mean_episode_reward": 0.19523336669009633, "best_episode_reward": 0.31638607161909527, "step": 0} {"episode": 20.0, "episode_reward": 36.724014494203736, "eval_time": 131.28322434425354, "mean_episode_reward": 36.724014494203736, "best_episode_reward": 91.59489000832528, "step": 5000} {"episode": 40.0, "episode_reward": 131.06965528251175, "eval_time": 125.38234829902649, "mean_episode_reward": 131.06965528251175, "best_episode_reward": 193.58535004428364, "step": 10000} {"episode": 60.0, "episode_reward": 279.8614716839785, "eval_time": 130.90317034721375, "mean_episode_reward": 279.8614716839785, "best_episode_reward": 333.89358538232943, "step": 15000} {"episode": 80.0, "episode_reward": 271.49715969091426, "eval_time": 119.71757483482361, "mean_episode_reward": 271.49715969091426, "best_episode_reward": 371.8324838952295, "step": 20000} {"episode": 100.0, "episode_reward": 305.84320230923447, "eval_time": 126.42067694664001, "mean_episode_reward": 305.84320230923447, "best_episode_reward": 412.6889181170833, "step": 25000}

Cup {"episode": 0.0, "episode_reward": 395.7, "eval_time": 82.90255808830261, "mean_episode_reward": 395.7, "best_episode_reward": 995.0, "step": 0} {"episode": 20.0, "episode_reward": 98.5, "eval_time": 98.97223258018494, "mean_episode_reward": 98.5, "best_episode_reward": 985.0, "step": 5000} {"episode": 40.0, "episode_reward": 98.8, "eval_time": 89.55685305595398, "mean_episode_reward": 98.8, "best_episode_reward": 988.0, "step": 10000} {"episode": 60.0, "episode_reward": 0.0, "eval_time": 83.97820687294006, "mean_episode_reward": 0.0, "best_episode_reward": 0.0, "step": 15000} {"episode": 80.0, "episode_reward": 0.0, "eval_time": 95.45060968399048, "mean_episode_reward": 0.0, "best_episode_reward": 0.0, "step": 20000} {"episode": 100.0, "episode_reward": 396.6, "eval_time": 86.55982041358948, "mean_episode_reward": 396.6, "best_episode_reward": 1000.0, "step": 25000}.

Therefore, I think the current score of RAD is evaluated with policy steps rather than environment steps. This explains why RAD greatly outperforms DrQ (I think they should hold similar performances) in the paper at 100k on DMC. If this is the case, do we need to evaluate RAD again with the proper protocol?

agarwl commented 2 years ago

Thanks @TaoHuang13 for the update! This is indeed confusing as I previously confirmed with the authors of the TOP-RAD about reporting results at 100k and 500k environment steps rather than agent steps. I'd also like to wait for their clarification on the discrepancy in their reported RAD results and yours.

In the meanwhile, if you can please provide me your raw results for RAD for at least 5 seeds (preferably 10) on the 6 tasks in DMC 100k and 500k, that would be great. This will allow me to update the figures / RAD scores in our NeurIPS paper.

Re reporting, I'd suggest using the results which you can replicate in your setup and using the protocols in rliable (perf profiles, aggregate metrics with CIs, probability of improvement etc).

TaoHuang13 commented 2 years ago

Sure! I'd like to share my raw results with you a few days later, since we have only test DMC 100k now. Meanwhile, please let me know if there are any updates about the performance discrepancy:)

tedmoskovitz commented 2 years ago

Hi @TaoHuang13 and @agarwl --

Thanks for bringing this to our attention! We've just looked at the code again, and short answer: we think you're right, and we sincerely apologize to both of you for the confusion.

To briefly explain: We built TOP on top of the original RAD code: https://github.com/tedmoskovitz/TOP/blob/master/dmc/top_train.py. When @agarwl originally asked about environment vs. policy steps, I looked at lines 271 and 322, saw that the environment took a step for each train step, and figured there was a 1:1 correspondence, forgetting that the environment is placed in a wrapper (a part of the original code which we did not modify) which repeats the action. Clearly, a case of carelessness on my part--I take full responsibility and apologize. We will adjust our reported results as soon as possible. On a positive note, we can point out that TOP-RAD is compared to RAD using the exact same framework, so their relative performance holds.

Thank you for your understanding, and apologies once again for the confusion.

agarwl commented 2 years ago

Thanks for the clarification, Ted! Since you are going to update the results, can you please also provide the updated scores and upload the raw scores here?

tedmoskovitz commented 2 years ago

Thank you very much for understanding, and absolutely--I'm not at home at the moment but I'll post them by the end of the day.

tedmoskovitz commented 2 years ago

Hi all, so I've attached the raw result files for 0-100k agent steps to this post. It turns out that I logged evaluation every 10k agent steps, so I don't actually have saved results for exactly 12.5k/25k agent steps (aka 100k env steps for walker and finger). The scores for those ranges do seem much more consistent with what you've found, @TaoHuang13. Each file is a 10 x 11 csv, where each row is a run and each run contains the score for 0, 10k, 20k, ...,100k agent steps. I'll have to re-run the code for the other environments to get exactly 100k environment steps, and I'll post those results when I have them, though it may be a few days due to the holidays. Thank you once again for bringing this to our attention!

rad_finger.csv rad_cartpole.csv rad_walker.csv rad_cup.csv rad_reacher.csv rad_cheetah.csv

TaoHuang13 commented 2 years ago

Thanks @tedmoskovitz for the clarification! It is indeed hard to notice the step setting in the original code RAD. We found it due to our computational limitation hah. Now there will be a larger room for improving performances. We are looking forward to your new results. But now, let us have a good holiday:) Merry Christmas @agarwl @tedmoskovitz~

tedmoskovitz commented 2 years ago

Hi @TaoHuang13 and @agarwl -- Thanks so much for your patience! I'm attaching the results for cartpole, cheetah, cup, and reacher to this comment. Each csv contains a 10 (seeds) x 6 array with eval results for 0, 100k, ..., 500k environment steps. I will post again once we've updated our paper as well. Happy holidays to both of you, and thank you once again! rad0_cartpole.csv rad0_cheetah.csv rad0_cup.csv rad0_reacher.csv

agarwl commented 2 years ago

Thanks @tedmoskovitz! If you can also post the results for walker and finger DMC tasks, that would be great.

tedmoskovitz commented 2 years ago

Right--here they are! Apologies for the delay. In this case, the arrays are 10 x 26, where each row contains the eval results for 0, 20k, 40k, ..., 100k, ..., 500k environment steps. rad0_finger.csv rad0_walker.csv

TaoHuang13 commented 2 years ago

Thank you @tedmoskovitz! The current results seem more reasonable:)

agarwl commented 2 years ago

Updated aggregate results attached below. image

tedmoskovitz commented 2 years ago

Cool! Thanks, Rishabh. I'm attaching the updated TOP-RAD results btw, if either of you are interested. Results relative to regular RAD are pretty analogous to the original results, but actually even better (relative to RAD) for the 100k regime, so that's nice. We'll be updating the paper in the next few days, just need to re-run some ablations.

top-rad_cartpole.csv top-rad_cheetah.csv top-rad_cup.csv top-rad_finger.csv top-rad_reacher.csv top-rad_walker.csv

TaoHuang13 commented 2 years ago

Hi @tedmoskovitz! Thank you for sharing your results. We are considering whether to add TOP-RAD as a baseline. One quick question is: which batch size are you using for RAD and TOP-RAD (128 or 512)?

tedmoskovitz commented 2 years ago

Of course! That would be great. We used a batch size of 128 for both.

tedmoskovitz commented 2 years ago

Hi guys-- just wanted to let you know that we've posted the updated paper for TOP: https://arxiv.org/abs/2102.03765. Thank you once again @TaoHuang13 for finding this mistake (we've added you in the acknowledgments), and @agarwl thank you for your understanding!