Not able to reproduce results using the code provided

agarwl commented 4 years ago

I tried replicating the results provided in the paper, however, I am getting much higher performance for normal PPO (something around 44.1%) while a lower performance with Rand Conv (53.34%). Can you please either provide your trained models or provide the complete code so that the results can be replicated?

I am using 1000 levels with 10 episodes per level and 10 MC samples for evaluating the rand conv agent.

kibok90 commented 4 years ago

Here is a tarball containing the trained models used to get performances in our paper. (25 checkpoints 3 trials 2 methods (vanilla and ours) = 150 files) https://drive.google.com/open?id=1Nm7QLD13_NGSA-ZIPU2ne2q3-NcJPE9M

We are sorry that the full code is not shareable because it contains some unpublished and private ideas. We are currently working on extracting the MC inference part for test and will update this repository soon.

agarwl commented 4 years ago

Thanks for your quick response as well as sharing the tarball. I understand your concern around sharing code with unpublished research and wish best of luck for your future research.

agarwl commented 4 years ago

I am still getting a lower performance using the checkpoints provided (50% for the first checkpoint) with 10 MC samples. Here is the config which I is being using during evaluation):

num_envs : 32
entropy_coeff : 0.01
real_thres : 0.9
rep : 10
learning_rate : 0.0005
use_batch_norm : 0
num_minibatches : 8
is_high_res : False
test_eval : True
dropout : 0.0
game_type : standard
train_flag : 1
run_id : tmp
num_steps : 256
use_black_white : 0
train_eval : False
set_seed : -1
save_interval : 10
use_lstm : 2
use_data_augmentation : 0
state_n_as_level_seed : 0
paint_vel_info : -1
num_levels : 1000
epsilon_greedy : 0.0
restore_id : try1_rand_fm_250M
ppo_epochs : 3
use_inversion : 0
test : False
num_eval : 1000
l2_weight : 0.0
fm_coeff : 0.002
high_difficulty : True
use_color_transform : 0
gamma : 0.999
frame_stack : 1
architecture : impala

Here's the output I get:

scores [5. 4. 3. 8. 5. 5. 5. 6. 6. 5. 3. 4. 5. 6. 3. 5. 6. 4. 4. 5. 6. 6. 4. 6.
 5. 8. 6. 3. 4. 5. 4. 3. 4. 7. 4. 6. 4. 5. 3. 7. 6. 3. 3. 4. 5. 5. 4. 4.
 5. 7. 6. 4. 3. 6. 6. 8. 7. 4. 5. 4. 6. 5. 4. 5. 4. 7. 3. 6. 5. 6. 4. 4.
 6. 7. 3. 2. 4. 6. 9. 5. 7. 5. 3. 9. 4. 4. 5. 6. 3. 5. 2. 6. 2. 7. 6. 6.
 6. 4. 7. 3. 3. 7. 6. 4. 5. 5. 5. 3. 4. 5. 7. 5. 4. 4. 3. 5. 4. 7. 6. 4.
 6. 5. 6. 3. 6. 4. 6. 2. 6. 6. 5. 5. 5. 6. 6. 3. 5. 5. 7. 5. 5. 6. 4. 6.
 5. 3. 4. 4. 4. 6. 3. 5. 7. 4. 3. 3. 7. 8. 7. 6. 6. 6. 3. 2. 8. 6. 5. 4.
 3. 6. 4. 5. 3. 5. 7. 6. 7. 5. 7. 5. 5. 5. 3. 4. 2. 3. 9. 7. 4. 3. 6. 6.
 5. 6. 7. 3. 4. 8. 4. 5. 5. 9. 3. 5. 5. 5. 5. 8. 6. 3. 6. 4. 6. 3. 4. 4.
 5. 4. 6. 5. 7. 6. 7. 1. 6. 4. 6. 7. 5. 5. 3. 5. 3. 4. 5. 6. 5. 5. 2. 6.
 3. 3. 6. 7. 6. 3. 6. 6. 4. 4. 8. 6. 3. 5. 5. 7. 5. 6. 8. 3. 5. 6. 4. 5.
 7. 5. 5. 7. 7. 7. 7. 3. 4. 4. 7. 6. 4. 5. 6. 7. 3. 5. 7. 5. 5. 5. 5. 2.
 2. 6. 4. 7. 7. 6. 6. 5. 6. 8. 6. 6. 4. 6. 5. 7. 1. 7. 5. 5. 6. 7. 4. 6.
 6. 9. 5. 7. 6. 7. 6. 6. 7. 6. 6. 7. 5. 6. 6. 3. 3. 3. 7. 6. 7. 5. 4. 1.
 3. 5. 4. 1. 4. 8. 4. 6. 4. 0. 3. 2. 6. 5. 5. 2. 6. 4. 5. 6. 5. 4. 5. 4.
 5. 4. 4. 3. 4. 4. 1. 7. 4. 5. 5. 5. 6. 6. 5. 8. 7. 7. 4. 5. 6. 5. 5. 6.
 5. 7. 4. 4. 5. 5. 4. 4. 7. 4. 5. 6. 4. 4. 4. 4. 5. 4. 5. 3. 6. 5. 7. 6.
 7. 4. 4. 6. 5. 6. 7. 5. 9. 6. 7. 4. 2. 4. 6. 5. 5. 8. 6. 5. 6. 4. 6. 5.
 6. 6. 6. 8. 3. 6. 4. 3. 3. 3. 6. 8. 4. 3. 5. 4. 7. 4. 7. 4. 5. 7. 5. 9.
 6. 3. 6. 4. 5. 6. 4. 2. 4. 5. 6. 5. 5. 9. 5. 6. 8. 5. 7. 6. 5. 4. 5. 5.
 3. 4. 4. 7. 6. 5. 5. 4. 6. 4. 6. 5. 3. 9. 8. 4. 2. 6. 5. 8. 6. 4. 7. 4.
 3. 3. 5. 6. 5. 8. 8. 5. 5. 4. 4. 5. 5. 5. 3. 8. 3. 6. 6. 4. 5. 3. 5. 6.
 7. 7. 8. 6. 2. 6. 5. 8. 3. 6. 6. 7. 6. 8. 6. 2. 5. 4. 6. 3. 5. 5. 4. 6.
 4. 4. 5. 6. 4. 5. 5. 5. 6. 5. 7. 4. 5. 1. 3. 5. 3. 4. 6. 4. 6. 4. 5. 3.
 6. 4. 6. 4. 4. 2. 4. 4. 5. 5. 7. 4. 4. 7. 5. 8. 9. 6. 8. 6. 6. 3. 6. 2.
 6. 5. 8. 6. 7. 3. 4. 4. 5. 7. 5. 4. 5. 4. 5. 6. 2. 7. 4. 2. 7. 5. 6. 1.
 5. 5. 4. 5. 5. 6. 5. 4. 3. 5. 6. 7. 4. 4. 7. 5. 6. 6. 4. 5. 7. 9. 4. 4.
 7. 5. 4. 4. 6. 4. 6. 5. 4. 4. 4. 8. 6. 4. 6. 8. 6. 5. 4. 6. 5. 6. 5. 4.
 5. 3. 6. 3. 6. 5. 3. 3. 6. 7. 6. 5. 6. 5. 4. 8. 3. 5. 6. 3. 4. 3. 7. 6.
 4. 7. 8. 2. 5. 6. 5. 6. 5. 7. 6. 6. 7. 5. 4. 5. 3. 7. 6. 7. 6. 5. 4. 9.
 3. 7. 5. 5. 4. 3. 5. 7. 4. 5. 6. 5. 7. 3. 6. 5. 7. 5. 5. 7. 3. 4. 6. 4.
 5. 5. 3. 7. 8. 2. 5. 4. 6. 6. 5. 5. 3. 4. 7. 5. 4. 6. 5. 2. 6. 7. 6. 4.
 8. 5. 2. 6. 5. 4. 3. 3. 7. 3. 4. 4. 4. 7. 5. 6. 5. 6. 6. 7. 4. 7. 5. 5.
 5. 1. 3. 4. 5. 7. 2. 5. 3. 4. 3. 8. 4. 7. 7. 6. 6. 6. 7. 5. 6. 3. 3. 4.
 7. 4. 7. 3. 6. 6. 7. 3. 6. 8. 6. 4. 6. 5. 3. 4. 6. 6. 2. 4. 6. 5. 4. 5.
 5. 3. 4. 5. 4. 4. 5. 4. 4. 4. 5. 2. 5. 6. 5. 6. 5. 5. 7. 4. 4. 7. 6. 3.
 3. 4. 4. 4. 4. 8. 5. 6. 4. 1. 6. 6. 2. 6. 3. 3. 5. 5. 3. 5. 8. 3. 6. 5.
 6. 2. 6. 5. 5. 4. 7. 5. 3. 3. 4. 1. 4. 3. 2. 5. 6. 5. 5. 4. 5. 6. 2. 6.
 5. 5. 5. 4. 6. 4. 5. 6. 2. 7. 7. 3. 5. 3. 3. 6. 4. 4. 4. 4. 4. 4. 5. 4.
 7. 5. 7. 4. 2. 5. 5. 5. 7. 4. 7. 8. 3. 8. 7. 4. 4. 8. 6. 3. 5. 4. 5. 5.
 4. 5. 3. 4. 6. 4. 4. 5. 4. 5. 4. 5. 4. 6. 6. 3. 4. 7. 3. 5. 7. 5. 6. 3.
 8. 8. 8. 4. 4. 3. 4. 5. 6. 5. 4. 3. 5. 6. 6. 5.]
mean_score 5.026
max idx 78
mpi_mean [5.026]

Can you confirm these checkpoints are getting the result you specified in the paper (can you please provide your config settings)?

kibok90 commented 4 years ago

I found some difference in your evaluation setting: rep : 10 -> 5 num_eval : 1000 -> 200 (edit: I mistakenly put 2 here; 200 is correct) num_levels : 1000 -> 0

If "num_levels" is positive, random numbers generated with a seed "set_seed" are kept in a list, and they are used for generating maps. So with your setting, rep*num_eval = 10k tests are conducted, and the map for each trial is sampled from this pre-determined 1000 map layouts.

One issue I found here is that it seems you didn't change the set_seed. Because of this, your test environments contain 500 seen map layouts and 500 unseen map layouts (but themes are unseen). "high_difficulty" may also affect the RNG, so my explanation would not be the same with the actual behavior. You may figure out it by checking the coinrun game code: https://github.com/openai/coinrun/blob/master/coinrun/coinrun.cpp

However, I don't think it explains why you got only around 50%. I got around 60% with the first model trained over 250M iterations. Maybe your MC inference is different from ours?

Anyway, with the fix I made above, 1000 unseen map layouts will be tested, and each map layout is tested only once.

agarwl commented 4 years ago

Edit: It seems that the MC inference was implemented differently than mentioned in the paper and fixing it does improve the result (I obtain something around 58.7% which seems much closer to what is reported in the paper). That said, I still don't understand some of the config changes recommended by you.

For MC, I simply compute the agent.pi multiple times for a given observation and average these to compute the logits used for computing the average policy and sampling an action from this policy.

Also, why is num_eval set to 2, doesn't this decide how many different environments are created for evaluation?

kibok90 commented 4 years ago

Good to hear that you were able to replicate the number!

We tried logit averaging (I believe this is what you did) and voting action (choose action per MC sample & take the most frequently chosen action) for MC sampling, but couldn't see a significant difference.

I am sorry that I put a wrong number there, any number satisfying rep*num_eval = 1000 is fine in principle. I used rep=5 and num_eval=200.

rep*num_eval is the number of environments used for test, if you set num_levels=0.

If num_levels is a positive number, then the number of environment is limited to that number. In this case, some environment would be revisited as I explained above.

agarwl commented 4 years ago

It seems that the third vanilla checkpoint you provided is quite bad (the other checkpoints seem fine and lead to a performance of 3.15 and 3.6 but the try3_vanilla_250M_0 ) leads to a performance of only 0.88. It would be great if you can provide the correct checkpoint (otherwise I can run a PPO agent and use it along with the two checkpoints you provided).

kibok90 commented 4 years ago

As you observed, one of the trial with the vanilla method has a lower performance than the others. You may take a look at the performance of it (solid blue line in Figure 5(a)), which has a larger variance than the others.

I would not be surprised if the vanilla method has a higher variance than other methods, because the test setting is basically zero-shot learning. Since what a deep RL agent learns is different over trials, we actually don't know whether it can either luckily adapt to new environment well or unfortunately fail to do so. The methods including ours are supposed to fill the gap between the seen and unseen environments, such that the effect of introducing them is not only to increase the performance but also reduce the variance of it.

You can run experiments in your side and take the performance you obtained. You know, the purpose of this repository is to help you to reproduce our experiments.

KaiyangZhou commented 4 years ago

@kibok90 @theSparta Hello, I've a question regarding the hardware requirement for running the coinrun experiment. How many gpus do I need in order to reproduce the results and how long does it take to finish one run? (p.s. I have one gtx 2080 ti for now so i've such concern)

pokaxpoka / netrand

Not able to reproduce results using the code provided #4