Open wertyuilife2 opened 3 months ago
Thanks for pointing out this discrepancy! The original motivation for this implementation was to account for masked out action dimensions in the multi-task case, but I agree that should ideally be addressed in a way that does not alter the log probabilities when no masking is applied. I'll try to issue a fix soon. Either way, I don't believe this discrepancy should affect results in any meaningful way.
Thanks for the explanation! I also agree that this isn’t a fundamental issue—I was just genuinely curious about the reasoning behind this design choice.
However, based on my previous experiments, at least on the Dog-Run task in DMControl, multiplying by the action dim does seem to bring a noticeable performance improvement compared to not doing so. I suggest paying attention to the performance on this task when applying the fix.
Interesting! It's possible that the scaling increases policy entropy a bit for tasks with large action spaces? I wonder if scaling the entropy coef with action space dim would yield the same result.
Scaling the entropy coef and scaling the log prob don't seem mathematically identical due to the additional squash() step, but I believe they will have similar effects.
I'm also curious about this. I wouldn't expect to have to scale the entropy/logprobs at all since they're already proportional to the action dim, but as @wertyuilife2 said I've also seen performance drops with the corrected equations.
And because @nicklashansen wanted to test this, here's the dog-run training reward for the original implementation (gray) and scaling the entropy coefficient directly with the updated equation (blue):
Interesting, thanks for running this! I wonder if this is more of a one-off phenomenon for Dog or if it applies to high-dimensional action spaces in general. I would caution against making conclusions based off of a small set of experiments since it's still RL after all. I'll see if I can run some experiments on this soon :-)
I have some questions about the way logprob is calculated in TDMPC2.
The code below shows that the logprob result in TDMPC2 seems to differ from
torchrl
and SAC.code result:
Specifically, TDMPC2 multiplies the logprob result by the action dimension during calculation. Could you please explain why this is done?