Closed Anya28 closed 5 years ago
The learning dynamics and hyperparameters will be very different, so it is natural for the curves to be also different. You'll need to tune the hyperparameters to get better results there.
Re: async flag, that's using a background thread for sampling and is only a slight perf improvement only for a3c. I wouldn't use it for anything else.
The learning dynamics and hyperparameters will be very different, so it is natural for the curves to be also different. You'll need to tune the hyperparameters to get better results there. Re: async flag, that's using a background thread for sampling and is only a slight perf improvement only for a3c. I wouldn't use it for anything else.
I see, so in the case of no vtrace, when the algorithm should behave like a3c, using a different optimizer would not be an issue?
There's no inherent issue with using the IMPALA optimizer without V-trace. However, without the correction, it is more likely the asynchrony of sample collection will impact learning. How much this matters depends on the hyperparameters and the particular env.
Ok, thank you for the clarifications!
System information
Describe the problem
I have a question regarding the IMPALA implementation. As specified in the file "impala.py", it uses an AsyncSamplesOptimizer. In the config it also has a flag "vtrace", which if true uses the vtrace policy graph, otherwise it uses the a3c policy graph. Hence, setting the flag "vtrace"=False in "impala.py", would lead the algorithm to be running as an a3c algorithm, with AsyncSamplesOptimizer. However, in the "a3c.py", it is specified that the algorithm uses the AsyncGradientOptimizer. In fact running the experiments for IMPALA, on the Qbert environment for example, with the "vtrace" flag set to false, the obtained results diverge significantly from a3c ones. In the plot attached, the orange curve is the A3C run, and the grey one is the IMPALA run with "vtrace"=False. Am I going wrong somewhere, or it looks like this case of IMPALA is not fully implemented?
As part of this question, a more general is: A3C uses the flag "sample_async"=True, and the AsyncGradientOptimizer. Previous IMPALA implementation had a flag, "sample_async"=False, which has now been discarded and is used as False by default, and optimizer - AsyncSamplesOptimizer. In RLLib this flag has a comment "Whether to compute samples asynchronously in the background, which improves throughput but can cause samples to be slightly off-policy". What is use of the async flag? As I thought that both of the algorithms are off-policy. Thank you!