huggingface / deep-rl-class

This repo contains the syllabus of the Hugging Face Deep Reinforcement Learning Course.
Apache License 2.0
3.73k stars 568 forks source link

[QUESTION] Is the differentiability of state distribution really necessary for deriving policy gradient? #385

Closed UniverseFly closed 9 months ago

UniverseFly commented 11 months ago

Thanks for this great course! I really enjoy it. However, I have a question about the content saying that being unable to differentiate the environment dynamics makes it impossible to compute the policy gradient (https://huggingface.co/learn/deep-rl-course/unit4/policy-gradient):

We have another problem that I explain in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called the Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can’t differentiate it because we might not know about it.

From my understanding after reading some other articles, the policy gradient theorem states that the objective function should always be differentiable regardless of the environment dynamics. We estimate the gradient only because it is (computationally) impossible to enumerate all possible trajectories. So my guess is that the statement above may not be very accurate.

I am just a beginner in RL; please feel free to let me know if I am wrong.

simoninithomas commented 11 months ago

Hey there 👋 . Indeed it's computationally impossible to enumerate all trajectories. But also because we might not know the state distribution. Since it's part of the environment dynamics and we might not have access to it.

The idea instead is as you mention to estimate the gradient based on the trajectories you took.

UniverseFly commented 10 months ago

Thanks Thomas for your explanation! I fully understand we need to estimate the gradient. However, I am still a little confused about whether the formula of the policy gradient depends on the knowledge of the environment dynamics in a mathematical sense, as it seems that the proof from the optional chapter still holds by replacing the summation with integral.