Closed pengzhenghao closed 4 years ago
For final evaluations, we report average returns across 10k episodes, where each episode uniformly samples a level from the appropriate distribution (train or test). In practice, this is probably more episodes than is necessary (I believe the std dev is already quite low after 1k episodes). Note that when evaluating training performance, different episodes may use the same level, since the training sets often have less than 10k levels.
Hi there! Thanks for this interesting environment. I am wondering how many environments you use during evaluation, because I can't found it in the paper as well as the code. It seems the number of environment (num_eval) is set to 20 but I am a little confused.
Could you please clarify how many distinguished environments is used during evaluation? Thanks!