hkproj / rlhf-ppo

Notes and commented code for RLHF (PPO)
21 stars 12 forks source link

question: how is the gradient of the log probs calculated? #1

Open letitfly opened 4 months ago

letitfly commented 4 months ago

hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!

Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".

As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?

Thanks in advance!

letitfly commented 3 months ago

never mind, i read the code and figure it out.