sebascuri / rllib

MIT License
19 stars 8 forks source link

Data sampling of BPTT #6

Open shenao-zhang opened 2 years ago

shenao-zhang commented 2 years ago

Hi Sebastian,

I am working on a project implementing BPTT. I see in your implementation that the states used for policy updates are sampled from the replay buffer. According to the RL objective J(\theta)=E_{s~initial_dist}[V(s)], shouldn't we sample states from the initial distribution?

Thanks for your wonderful code!

Shenao