Inconsistency in IMPALA implementation

Anya28 commented 5 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): source
Ray version: 0.5.3
Python version: 3.6.7

Describe the problem

I have a question regarding the IMPALA implementation. As specified in the file "impala.py", it uses an AsyncSamplesOptimizer. In the config it also has a flag "vtrace", which if true uses the vtrace policy graph, otherwise it uses the a3c policy graph. Hence, setting the flag "vtrace"=False in "impala.py", would lead the algorithm to be running as an a3c algorithm, with AsyncSamplesOptimizer. However, in the "a3c.py", it is specified that the algorithm uses the AsyncGradientOptimizer. In fact running the experiments for IMPALA, on the Qbert environment for example, with the "vtrace" flag set to false, the obtained results diverge significantly from a3c ones. In the plot attached, the orange curve is the A3C run, and the grey one is the IMPALA run with "vtrace"=False. Am I going wrong somewhere, or it looks like this case of IMPALA is not fully implemented?

As part of this question, a more general is: A3C uses the flag "sample_async"=True, and the AsyncGradientOptimizer. Previous IMPALA implementation had a flag, "sample_async"=False, which has now been discarded and is used as False by default, and optimizer - AsyncSamplesOptimizer. In RLLib this flag has a comment "Whether to compute samples asynchronously in the background, which improves throughput but can cause samples to be slightly off-policy". What is use of the async flag? As I thought that both of the algorithms are off-policy. Thank you!

impala_a3c

ericl commented 5 years ago

The learning dynamics and hyperparameters will be very different, so it is natural for the curves to be also different. You'll need to tune the hyperparameters to get better results there.

Re: async flag, that's using a background thread for sampling and is only a slight perf improvement only for a3c. I wouldn't use it for anything else.

Anya28 commented 5 years ago

The learning dynamics and hyperparameters will be very different, so it is natural for the curves to be also different. You'll need to tune the hyperparameters to get better results there. Re: async flag, that's using a background thread for sampling and is only a slight perf improvement only for a3c. I wouldn't use it for anything else.

I see, so in the case of no vtrace, when the algorithm should behave like a3c, using a different optimizer would not be an issue?

ericl commented 5 years ago

There's no inherent issue with using the IMPALA optimizer without V-trace. However, without the correction, it is more likely the asynchrony of sample collection will impact learning. How much this matters depends on the hyperparameters and the particular env.

Anya28 commented 5 years ago

Ok, thank you for the clarifications!

ray-project / ray

Inconsistency in IMPALA implementation #3993

System information

Describe the problem