This may be difficult to do, as I'm not sure that multiple workers each doing learning will be easily compatible with our current predictor setup.
The predictor could use the same gradient averaging method that we use in PPO, but this would likely require a major refactor, and make parallel_trpo no longer work.
This may be difficult to do, as I'm not sure that multiple workers each doing learning will be easily compatible with our current predictor setup.
The predictor could use the same gradient averaging method that we use in PPO, but this would likely require a major refactor, and make
parallel_trpo
no longer work.