The log_prob is not corrected

Hi, Thanks for releasing the code. I noticed that in the policy network, you simply squash the mean with tanh without correcting the log-probability as, for example, SAC did in their parameterization of the policy. Will this cause bias to the estimation of the gradient of the policy? https://github.com/ikostrikov/implicit_q_learning/blob/09d700248117881a75cb21f0adb95c6c8a694cb2/policy.py#L56

I'm debugging my implementation of IQL and XQL, and I'm not sure whether this causes the performance gap or not. Please correct me if there is any mis-understanding.

ikostrikov / implicit_q_learning

The log_prob is not corrected #9