google-research / rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
https://agarwl.github.io/rliable
Apache License 2.0
765 stars 47 forks source link

Urgent question about data aggregates #4

Closed slerman12 closed 2 years ago

slerman12 commented 2 years ago

Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.

We have median human-norm scores all around 0.10 - 0.12.

Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.

agarwl commented 2 years ago

That doesn't seem right -- the aggregate scores should match as in figure below (uses 10 runs), which can be done using the colab at bit.ly/statistical_precipice_colab:

image.