ikostrikov / implicit_q_learning

MIT License
226 stars 38 forks source link

The log_prob is not corrected #9

Closed typoverflow closed 1 year ago

typoverflow commented 1 year ago

Hi, Thanks for releasing the code. I noticed that in the policy network, you simply squash the mean with tanh without correcting the log-probability as, for example, SAC did in their parameterization of the policy. Will this cause bias to the estimation of the gradient of the policy? https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/policy.py#L56

I'm debugging my implementation of IQL and XQL, and I'm not sure whether this causes the performance gap or not. Please correct me if there is any mis-understanding.

typoverflow commented 1 year ago

Sorry, seems that iql is not using tanh-squashed distributions, hence no correction is needed. closing this issue =)