Open miguelsuau opened 5 years ago
Did you figure out an answer @miguelsuau ?
Bump @ikostrikov
I started a discussion in the PyTorch forums and used Pytorchviz to visualize the backpropagation graph of this and my implementation. But so far I did not gain any meaningful insights.
I am quite certain gradients get backpropagated through the whole sequence (128 steps) but it would be good if @ikostrikov could confirm this.
I printed the shape of the inputs to the GRU layer and observed that the sequence lengths vary (probably depending on the episode length). So the max length of the sequences is 128.
Sorry @miguelsuau, I've just noticed the issue. Yes, it backpropagates through the whole sequence (128 in this case).
In the case of running:
python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 0.5 --num-processes 8 --num-steps 512 --num-mini-batch 4 --log-interval 1 --use-linear-lr-decay --entropy-coef 0.01 --recurrent-policy
and printing print(x[start_idx:end_idx].size())
at model.py:154:
torch.Size([325, 2, 512])
torch.Size([28, 2, 512])
torch.Size([159, 2, 512])
I suppose the first dimension (sequence length) is correlated to the episode length and the second dimension (batch_size) to num_processes as there are two processes in each mini batch. However, that doesn't make sense as two processes are not likely to have the same episode length.
Could you shed some light into this @ikostrikov ?
@MarcoMeter Any update on this?
I was also wondering if this makes sense if we use high number of steps, since this makes learning with gru or lstm more difficult, right? perhaps doing a length clipping for "start_idx:end_idx" helps.
I pretty much abandoned this repository to work on my own implementation with more comments and documentation. Still WIP. https://github.com/MarcoMeter/neroRL/tree/update/sequence_buffer_masked_loss
@MarcoMeter, I guess you already figured this out but just for reference, I think the sequences can contain experiences from different episodes. The gradients are just zeroed (using masks) so they are not backpropagated from one episode to another.
@a-z-e-r-i-l-a, it depends on the environment if the agent needs to memorize events that are as distant in the past as the whole sequence then you need the gradients to backpropagate through the entire sequence. If not, then you can shorten the sequence length to see if this improves the sample complexity.
@MarcoMeter I also used Torchviz to check that the gradient of the recurrent policy was correctly back-propagated. At first, I got the same result as you showed in the Pytorch forum post, the GRU module only showed once. Later, after debugging, I found that Torchviz does not seem to show the GRU recursion relations correctly when PyTorch uses the GPU. After forcing using the CPU, it was found that the gradient did get back-propagated through the whole episode.
For the problem of the unequal length of episodes. The first dimension represents the length of one episode. The first dimension of these three tensors added up to the num-steps parameter. The purpose of this is to reset the hidden variable when the episode is completed to avoid passing the hidden variable through episode boundaries.
@binaryoung Thanks for sharing your findings!
A couple of weeks ago I published a baseline/reference implementation that does truncated bptt. https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt
Hi Ilya,
First of all thanks for sharing your code. It has been very useful to me lately. This is more of a question rather than an issue:
When you update the recurrent policy, how many steps are the gradients backpropagated? I am not very familiar with Pytorch, but in Tensorflow this is normally specified with the sequence_length parameter. From what I could see in your code you update the model using the entire sequence so I am guessing the gradients are backpropagated 128 steps?
Thanks in advance,
Miguel