How to measure RL sample efficiency

Hi,

My question relates to the RL results in Table 3 of the paper. I’m trying to use the iclr19 branch to generate at least 10 such results (for each level) to get stable mean and variance. The train_rl.py script seems to do almost everything required. But at the bottom of that file, after calculating the success rate over the (default of 512) episodes that were tested, the success rate is not actually logged. The mean return is logged instead.

Adding the following line (right after the calculation of success_rate) seems to log the missing number:

logger.info("Success rate {: .4f} reached after {} training episodes".format(success_rate, status['num_episodes']))

Also, it seems that the default save_interval of 1000 is too large for some of the easier levels. For instance, to get sufficiently frequent tests on GoToRedBallGrey, I call the script like this:

python scripts/train_rl.py --env BabyAI-GoToRedBallGrey-v0 --save-interval 10

Then to obtain the sample efficiency, I just look in the log for the first success rate to exceed 0.99, and take the number of training episodes up to that point. For seed=1, it happens on this line:

main: 2019-06-17 01:27:36,671: Success rate 0.9922 reached after 30769 training episodes

Is this the right way to generate more RL results like those in Table 3? Or is there an easier way?

Thank you for this excellent environment!

mila-iqia / babyai

How to measure RL sample efficiency #74