I am working on a project implementing BPTT. I see in your implementation that the states used for policy updates are sampled from the replay buffer. According to the RL objective J(\theta)=E_{s~initial_dist}[V(s)], shouldn't we sample states from the initial distribution?
Hi Sebastian,
I am working on a project implementing BPTT. I see in your implementation that the states used for policy updates are sampled from the replay buffer. According to the RL objective J(\theta)=E_{s~initial_dist}[V(s)], shouldn't we sample states from the initial distribution?
Thanks for your wonderful code!
Shenao