It seems that the importance sampling code part is wrong.

ikostrikov / pytorch-trpo

PyTorch implementation of Trust Region Policy Optimization

MIT License

433 stars 91 forks source link

It seems that the importance sampling code part is wrong. #22

Open yhy258 opened 1 year ago

yhy258 commented 1 year ago

https://github.com/ikostrikov/pytorch-trpo/blob/e200eb8a23b3c7941a0091efb9750dafa4b23cbb/main.py#L108-L119

The fixed log prob part of the line and the "get_loss" function part are exactly the same. The two parts are executed consecutively so that the two values ("fixed_log_prob", "log_prob") are exactly the same. Is there a reason you wrote the code like this?

asyua-ye commented 10 months ago

get_kl，also has this problem

HaoxiangYou commented 1 month ago

Hi, I believe the code should be correct. If you check the "fixed_log_prob", it is a constant tensor (no dependent on nn parameters); and if you check the "log_prob" you will see "grad_fn = ..." (dependent on the nn parameters). This is exactly what we want for importance sampling(treat the current parameters as fixed old policy), same logic apply for get_kl().