When this resulting log_prob is used in the loss to update alpha, It seems that there is some imbalance between it and how the target_entropy is calculated. The target one just takes into account the dimensionality of the action vector, but the log_prob is affected by the action_scale.
At the end, aren't we just comparing a target entropy with the entropy of the policy? and since this last one is basically given by the standard deviation in the case of Gaussians, could we just return that entropy in the place of the log_pi returned by policy.sample(state), or simply the sum of the elements of normal.log_prob(x_t) as if the above indicated line was removed?
Thanks in advance. Sorry if I said something stupid, by I am confused and would really appreciate some help to understand what's going on.
First of all, thanks for this amazing repo!
I am trying to clarify why the log_prob of the action taken by the policy is calculated as in this line: https://github.com/pranz24/pytorch-soft-actor-critic/blob/847edf58a5e5f206ff2ea5e2d993c08972729a15/model.py#L103
When this resulting log_prob is used in the loss to update alpha, It seems that there is some imbalance between it and how the target_entropy is calculated. The target one just takes into account the dimensionality of the action vector, but the log_prob is affected by the action_scale.
At the end, aren't we just comparing a target entropy with the entropy of the policy? and since this last one is basically given by the standard deviation in the case of Gaussians, could we just return that entropy in the place of the log_pi returned by policy.sample(state), or simply the sum of the elements of normal.log_prob(x_t) as if the above indicated line was removed?
Thanks in advance. Sorry if I said something stupid, by I am confused and would really appreciate some help to understand what's going on.