Why append additional (s a r) pair to the replay buffer after one episode is done?

ghliu / pytorch-ddpg

Implementation of the Deep Deterministic Policy Gradient (DDPG) using PyTorch

Apache License 2.0

569 stars 157 forks source link

Why append additional (s a r) pair to the replay buffer after one episode is done? #8

Open Hanrui-Wang opened 5 years ago

Hanrui-Wang commented 5 years ago

Hi Guan-Horng,

Thanks for your great implementation! I am wondering why do we append additional (s a r) pair to the replay buffer after one episode is done? The reward in that pair is zero, I think it is probably not mentioned in the original paper.

https://github.com/ghliu/pytorch-ddpg/blob/e9db328ca70ef9daf7ab3d4b44975076ceddf088/main.py#L64

Thank you!

zhihanyang2022 commented 3 years ago

I think this is weird, too.

agent.memory.append(
                observation,
                agent.select_action(observation),
                0., False
            )

Also, done is set to False is this tuple, which is more perplexing.

zhihanyang2022 commented 3 years ago

Having said so, I think this would probably have a negligible effect in terms of learning, given that the replay buffer is so big, but I think it's good for the author to check on this @ghliu .

friedmainfunction commented 2 years ago

In Buffer's code，I guess the terminal state can be used to divide transitions from each episodes，in terms this，I think maybe it‘s a bug.