hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!
Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".
As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?
hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!
Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".
As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?
Thanks in advance!