Closed KaiyangZhou closed 4 years ago
This should be possible with the plots.py
script in the coinrun
folder.
You'll have to adapt the path
variable to point to the folder in which all the results for the different runs are saved. Then you can use the experiments
dictionary to specify which runs (specified by their run-id
) correspond to which algorithm, i.e. each entry in the experiments
dictionary will result in one plotted line with the key used in the legend. The values and std for each line are the result of averaging over all the run-id
s specified in the list, which is the value for the corresponding key in experiments
(hopefully that becomes clearer from the examples in plots.py
).
Thanks! Your explanation on plots.py
is very clear.
I have another question. Does fig.3(left) in the paper record the performance on train or test (unseen) environments? From this line https://github.com/microsoft/IBAC-SNI/blob/master/coinrun/coinrun/ppo2.py#L395 it seems rew_mean
obtained from here https://github.com/microsoft/IBAC-SNI/blob/master/coinrun/plots.py#L194 refers to the performance on train environment?
I'm a bit confused now. I thought we have to run enjoy.py
by loading the model saved at every 10M timesteps to get the test performance curve on unseen environments. Could you clarify this?
Ah, good question. The code is based in the openai baselines implementation, which is somewhat different from many other frameworks. By using RCALL_NUM_GPU=4 mpiexec -n 4 python3 -m coinrun.train_agent ...
to start the experiments (see the README.MD
), we're actually starting 4 different processes on 4 different GPUs. Process 0,2,3 are using the training environments and are updating the policy parameters, process 1 is running the test environments.
That means, that rew_mean
is the training performance if it's extracted from a file with the ending _0
(meaning coming from process 0) and it's the test performance if it's extracted from a file ending with _1
.
As you might have noticed, in plot.py
, the specified run-ids
actually end with _{}
(which I forgot to mention before). That is because in this line I'm replacing the {}
with either 0
or 1
, depending on whether I want the test or training performance.
enjoy.py
is not used at all, instead, we are indeed evaluating the test performance concurrently with training, saving it in files ending on _1
.
Cool. Now I'm clear.
In my case, I'm using 1 gpu per job so I have to run enjoy.py
for every saved checkpoint in order to get the test performance (am I doing wrong?).
One more thing, what values do you use/suggest for -num-eval N -rep K
?
P.s. just recalled that we had a conversation back in https://github.com/openai/coinrun/issues/7 (I was thinking why your account looks familiar)
I haven't really worked with enjoy.py
so I'm not sure. I'd just try out a few values and see what the variance is when you run it multiple times.
Just to warn you, I'm not sure the results will be the same when you only run on one GPU with the same hyperparameters. Usually, the gradients from the 3 training processes are averaged, effectively tripling the batch size. On the other hand, just tripling the batch size on one GPU might be infeasible due to memory constraints (although I haven't tried that).
Btw, if you're looking for an implementation of IBAC-SNI on the whole ProcGen suite, see here. The new ProcGen suite also implements "easy" versions of the environments which might be more feasible to run with just one GPU.
Got it. Thanks again!
plots.py
is great and I've successfully used it for producing fig.3(left) in the paper
just wondering if you have code to produce fig.3(middle) for the generalization gap?
another question: when drawing the score curve (for RL tasks), is using moving average to smooth out the scores a common practice in visualization? (I don't read many papers in RL so I'm curious about this)
Great, glad it worked!
I've looked, but I don't think I have the code anymore, not sure what happened to it, maybe it got deleted when I cleaned up the repo for publication.
However, it should be fairly straightforward to re-implement based on plots.py
as it already reads in the required data and one only needs to subtract test from train performance.
Unfortunately, I don't have the time at the moment, but if you decide to implement it, please consider submitting a PR, it would be great to have the functionality in the codebase.
thanks!
@maximilianigl how about the moving average issue?
Ah, sorry about that, completely overlooked that question. Yes, I think so. RL results are typically quite noisy (stochastic policy, potentially stochastic environment) and evaluation is costly (running entire episodes), so smoothing is a way to keep the plots readable. The std shown in RL plots is usually over random seeds, as there is also a lot of variation.
@maximilianigl Hi, just wanna thank you for your help in the code.
My work has been accepted to ICLR'21. The RL code is based on yours :).
https://github.com/KaiyangZhou/mixstyle-release/tree/master/rl.
Congratulations! Just had a look and it's a cool idea! Looking forward to reading the paper.
Hello,
I'm wondering how to draw a figure like Fig.3 in the paper with your code (without re-writing the
enjoy.py
script)? Or does the original code already support this but I somehow missed, could you please point out?Thanks