I have been using this repo for studying and testing for some time now (using the navigation_graph scenario) and earlier this week I ran into an interesting issue.
To make testing/debugging easier I have set the number of n_rollout_threads to 1 to remove the parallel processing to make it a little easier to track environmental data over time.
However, when you do this, the network completely losses its ability to converge towards any kind of optimal policy and I do not understand why since the implementation of parallel processing should only effect the speed of training and not the effectiveness.
For more context here is the command I used to remove parallel processing where n_rollout_threads=1 and auto_mini_batch_size has been removed in order to maintain the target batch size of 128:
Hello again,
I have been using this repo for studying and testing for some time now (using the
navigation_graph
scenario) and earlier this week I ran into an interesting issue.To make testing/debugging easier I have set the number of
n_rollout_threads
to 1 to remove the parallel processing to make it a little easier to track environmental data over time.However, when you do this, the network completely losses its ability to converge towards any kind of optimal policy and I do not understand why since the implementation of parallel processing should only effect the speed of training and not the effectiveness.
For more context here is the command I used to remove parallel processing where n_rollout_threads=1 and auto_mini_batch_size has been removed in order to maintain the target batch size of 128:
Here is the command I typically use where training converges succesfully:
Using the first command, the average reward over time isn't even oscillating it is just random.
Do you have any insight on why this is happening?