Closed UniverseFly closed 9 months ago
Hey there 👋 . Indeed it's computationally impossible to enumerate all trajectories. But also because we might not know the state distribution. Since it's part of the environment dynamics and we might not have access to it.
The idea instead is as you mention to estimate the gradient based on the trajectories you took.
Thanks Thomas for your explanation! I fully understand we need to estimate the gradient. However, I am still a little confused about whether the formula of the policy gradient depends on the knowledge of the environment dynamics in a mathematical sense, as it seems that the proof from the optional chapter still holds by replacing the summation with integral.
Thanks for this great course! I really enjoy it. However, I have a question about the content saying that being unable to differentiate the environment dynamics makes it impossible to compute the policy gradient (https://huggingface.co/learn/deep-rl-course/unit4/policy-gradient):
From my understanding after reading some other articles, the policy gradient theorem states that the objective function should always be differentiable regardless of the environment dynamics. We estimate the gradient only because it is (computationally) impossible to enumerate all possible trajectories. So my guess is that the statement above may not be very accurate.
I am just a beginner in RL; please feel free to let me know if I am wrong.