rll-research / url_benchmark

MIT License
328 stars 50 forks source link

Values used for normalized score calculation #1

Closed Randl closed 2 years ago

Randl commented 2 years ago

I couldn't find the values used for normalized score calculation neither in paper nor in repo. It would be convenient if we'd be able to compare new methods based on the same metric (mean normalized return). Also the values themselves do not appear anywhere in the paper, only on figures, which is a bit confusing.

MishaLaskin commented 2 years ago

Good catch! Here are the expert scores, we'll update the repo + paper to include these numbers.

walker_stand 984
walker_walk 971
walker_run 796
walker_flip 799
quadruped_walk 866
quadruped_run 888
quadruped_stand 920
quadruped_jump 888
jaco_reach_top_left 191
jaco_reach_top_right 223
jaco_reach_bottom_left 193
jaco_reach_bottom_right 203
Randl commented 2 years ago

@MishaLaskin Also, how error bars for Fig. 3 are calculated? I tried to use std over tasks for https://paperswithcode.com/task/unsupervised-reinforcement-learning but it's definitely too large. Did you take into account the std of the expert scores? Then probably under normal approximation for the ratio distribution the usual std of sum can be used? In that case, can you share stds too?

MishaLaskin commented 2 years ago

The error bars are standard errors (so taking # of seeds run into account to get a tighter estimate of the mean). Because expert scores are only there for normalization purposes, we just divide by the expert expert score without considering its standard deviation.

Randl commented 2 years ago

I think that expert std should be taken into account (if it has very high std then it is pretty clear that estimates are less confident) -- https://en.wikipedia.org/wiki/Ratio_distribution#Uncorrelated_noncentral_normal_ratio says the distribution of ratio can be estimated with normal one, taking into account both stds (and converging to normalized std if expert is very low std)

MishaLaskin commented 2 years ago

That's a fair point, but we are using expert scores only as a way to display scores (it's just a scaling factor). We could instead just use the raw scores. Additionally, for all envs considered, the expert scores which are results of running supervised RL for 2M steps have low variance (see @denisyarats pytorch_sac and DrQ / DrQv2 repos) so this shouldn't be an issue.