Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.
We have median human-norm scores all around 0.10 - 0.12.
Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.
That doesn't seem right -- the aggregate scores should match as in figure below (uses 10 runs), which can be done using the colab at bit.ly/statistical_precipice_colab:
Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.
We have median human-norm scores all around 0.10 - 0.12.
Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.