google-research / rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.
https://agarwl.github.io/rliable
Apache License 2.0
747 stars 46 forks source link

bootstrapped ci (shows no variance) vs std (shows high variance) #20

Closed MarcoMeter closed 2 months ago

MarcoMeter commented 1 year ago

Hey folks!

I frequently follow rliable's guidelines to plot sample efficiency curves. I came across results now where 5 seeds of one experiment had large variance, but the bootstrapped confidence interval suggests little to no variance. Here are two plots to visualize my issue:

comparison(1)

The number of bootstrap replications is set to 50000. Here is a colab notebook to reproduce these plots: https://colab.research.google.com/drive/1hFtmCX-TLUcPuDKZZlTPq34R7bDz_NWI?usp=sharing

It would be great to hear your intuitions about this. Do you think this is just a coincidence or a bug?

edit:

MarcoMeter commented 1 year ago

emm

The grey curve is the most problematic one. The IQM already shows strong volatility, while the stratified bootstrapped confidence interval is very narrow.

Utilizing less data or further lowering the reps do not seem to effect the intervals.

agarwl commented 1 year ago

I'd have to take a closer look sometime next week but usually this issue happens due to not bootstrapping over the correct axis (The readme specifies shape of the data expected). I think you want to switch the task and seed axis to fix this.

If you have a single task, then you can turn on task_bootstrap=True to not worry about shape related issues. https://github.com/google-research/rliable/blob/master/rliable/library.py#L215

MarcoMeter commented 1 year ago

Thanks for your reply @agarwl

My current take is to have the data in the shape of (5 runs, 150 episodes, 101 checkpoints). Compared to your terms: checkpoints = frames, episodes = games, runs = training repetitions = tasks

This is the result if task_bootstrap = False

trxl_gt_ci_0

and if task_bootstrap = True. The intervals with task bootstrapping are more pronounced.

trxl_gt_ci_1

With task_bootstrap = True and a shape of (750 episodes, 101 checkpoints), the CIs are messed up.

trxl_gt_ci_2

Given the same shape of (750, 101) and task_bootstrap = False, the plot seems equivalent to the first one.

trxl_gt_ci_3