Computing value of terminal state

openai / spinningup

An educational resource to help anyone learn deep reinforcement learning.

https://spinningup.openai.com/

MIT License

9.96k stars 2.19k forks source link

Computing value of terminal state #257

Open Ankur-Deka opened 4 years ago

Ankur-Deka commented 4 years ago

In the Pytorch implementation of the algorithms, there is a slight issue with computing value of terminal state. Let's look at the example of PPO:

Line 321 in ppo.py: if timeout or epoch_ended: _,v,_ = ac.step(o) else: v = 0

I think the condition should rather be: if not d

It might so happen that d and epoch_ended occur at the same time. We would want to check whether it's the end of the episode when assigning v to terminal state.

yuanzhi0515 commented 4 years ago

Same question in PyTorch implemenation of VPG.And I find that tensorflow implementation of vpg is different with PyTorch.

Alberto-Hache commented 2 years ago

This same issue has been raised and a solution proposed in pull request 'Fix value bootstrapping' #233

However, that PR suggests using v for timeout conditions too:

timeout = ep_len == max_ep_len
...
if d and not timeout:
    v = 0
else:
    _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))

I'd favor @Ankur-Deka 's solution since max_ep_len can be reached in non terminal states (unrelated to environment's dynamics).