[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples?

huggingface / deep-rl-class

This repo contains the syllabus of the Hugging Face Deep Reinforcement Learning Course.

Apache License 2.0

3.91k stars 603 forks source link

[QUESTION] How P(τ;θ) disappeared while estimating the gradients using trajectory samples? #495

Open ritwikmishra opened 8 months ago

ritwikmishra commented 8 months ago

I am referring to the gradient derivation here.

The paragraph where the instructor claimed "we can approximate the likelihood ratio policy gradient with sample-based estimate" then term of P(τ;θ) (probability of trajectory τ given the parameters θ) disappeared in the subsequent summation. Why?

I asked the same question on the discord study-group (here) but got no response.

simoninithomas commented 8 months ago

Hey there 👋

So P(tau;theta) is The probability of a trajectory but we can't have it. Since it would imply to know the environment dynamics (state dist)

If you look at the formulas after what we do is:

Replace P(τ;θ) (impossible to calculate)
With where tau(i) is a sampled trajectory

Don't hesitate to take a piece of paper and write each part step by step to understand better. It's how I've did it.

ritwikmishra commented 8 months ago

@simoninithomas I am sorry but it is still unclear to me. My doubt is... how we jumped from this

to this

shouldn't it be as follows: