ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k stars 829 forks source link

Question about backpropagation through time #210

Open miguelsuau opened 5 years ago

miguelsuau commented 5 years ago

Hi Ilya,

First of all thanks for sharing your code. It has been very useful to me lately. This is more of a question rather than an issue:

When you update the recurrent policy, how many steps are the gradients backpropagated? I am not very familiar with Pytorch, but in Tensorflow this is normally specified with the sequence_length parameter. From what I could see in your code you update the model using the entire sequence so I am guessing the gradients are backpropagated 128 steps?

Thanks in advance,

Miguel

MarcoMeter commented 4 years ago

Did you figure out an answer @miguelsuau ?

Bump @ikostrikov

I started a discussion in the PyTorch forums and used Pytorchviz to visualize the backpropagation graph of this and my implementation. But so far I did not gain any meaningful insights.

miguelsuau commented 4 years ago

I am quite certain gradients get backpropagated through the whole sequence (128 steps) but it would be good if @ikostrikov could confirm this.

MarcoMeter commented 4 years ago

I printed the shape of the inputs to the GRU layer and observed that the sequence lengths vary (probably depending on the episode length). So the max length of the sequences is 128.

ikostrikov commented 4 years ago

Sorry @miguelsuau, I've just noticed the issue. Yes, it backpropagates through the whole sequence (128 in this case).

MarcoMeter commented 4 years ago

In the case of running:

python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --value-loss-coef 0.5 --num-processes 8 --num-steps 512 --num-mini-batch 4 --log-interval 1 --use-linear-lr-decay --entropy-coef 0.01 --recurrent-policy

and printing print(x[start_idx:end_idx].size()) at model.py:154:

torch.Size([325, 2, 512])
torch.Size([28, 2, 512])
torch.Size([159, 2, 512])

I suppose the first dimension (sequence length) is correlated to the episode length and the second dimension (batch_size) to num_processes as there are two processes in each mini batch. However, that doesn't make sense as two processes are not likely to have the same episode length.

Could you shed some light into this @ikostrikov ?

a-z-e-r-i-l-a commented 3 years ago

@MarcoMeter Any update on this?

I was also wondering if this makes sense if we use high number of steps, since this makes learning with gru or lstm more difficult, right? perhaps doing a length clipping for "start_idx:end_idx" helps.

MarcoMeter commented 3 years ago

I pretty much abandoned this repository to work on my own implementation with more comments and documentation. Still WIP. https://github.com/MarcoMeter/neroRL/tree/update/sequence_buffer_masked_loss

recurrent policy doc

miguelsuau commented 3 years ago

@MarcoMeter, I guess you already figured this out but just for reference, I think the sequences can contain experiences from different episodes. The gradients are just zeroed (using masks) so they are not backpropagated from one episode to another.

@a-z-e-r-i-l-a, it depends on the environment if the agent needs to memorize events that are as distant in the past as the whole sequence then you need the gradients to backpropagate through the entire sequence. If not, then you can shorten the sequence length to see if this improves the sample complexity.

binaryoung commented 3 years ago

@MarcoMeter I also used Torchviz to check that the gradient of the recurrent policy was correctly back-propagated. At first, I got the same result as you showed in the Pytorch forum post, the GRU module only showed once. Later, after debugging, I found that Torchviz does not seem to show the GRU recursion relations correctly when PyTorch uses the GPU. After forcing using the CPU, it was found that the gradient did get back-propagated through the whole episode.

For the problem of the unequal length of episodes. The first dimension represents the length of one episode. The first dimension of these three tensors added up to the num-steps parameter. The purpose of this is to reset the hidden variable when the episode is completed to avoid passing the hidden variable through episode boundaries.

MarcoMeter commented 3 years ago

@binaryoung Thanks for sharing your findings!

A couple of weeks ago I published a baseline/reference implementation that does truncated bptt. https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt