states = torch.cat(states)
gae = 0
R = []
for value, reward, done in list(zip(values, rewards, dones))[::-1]: # len(list(zip(values, rewards, dones))[::-1]) is 512
gae = gae opt.gamma opt.tau
gae = gae + reward + opt.gamma next_value.detach() (1 - done) - value.detach()
next_value = value
R.append(gae + value)
##################################################################################
Question: with --num_local_steps=512 and —num_processes=8, after 'values = torch.cat(values).detach()’, the values.shape is torch.Size([4096]). But this list: "list(zip(values, rewards, dones))[::-1]”, the length is 512, which mean only the first 512 items “values" are used in the "for…loop”, the others are discarded.
So, in every 512 local_steps, only the values of first 64(=512/8) steps are used to calculate GAE and R. Is it a problem or I have misunderstanding?
Looking for your answer, thanks!
While study your Mario PPO codes, https://github.com/uvipen/Super-mario-bros-PPO-pytorch/blob/master/train.py, it’s hard to understand the following codes:
################################################################################ values = torch.cat(values).detach() # torch.Size([4096])
states = torch.cat(states) gae = 0 R = [] for value, reward, done in list(zip(values, rewards, dones))[::-1]: # len(list(zip(values, rewards, dones))[::-1]) is 512 gae = gae opt.gamma opt.tau gae = gae + reward + opt.gamma next_value.detach() (1 - done) - value.detach() next_value = value R.append(gae + value) ##################################################################################