With the author's implementation, we can get the loss lower than -350, while using accel we can't even reach -300, which leads to slower and unstable learning.
The cause of unstable performance seems to be because I treat timeout frames as done frames in the preprocessing here.
I'll fix the line later, but the high policy loss issue still needs to be investigated.
policy_loss
in SAC_CQL is significantly higher than the official implementation when tested withhopper-expert-v0
in d4rl. https://github.com/waffoo/accel/blob/af3f511ea816b2dd80346fe5a0b5e2b395c190ad/accel/agents/sac_cql.py#L261With the author's implementation, we can get the loss lower than -350, while using accel we can't even reach -300, which leads to slower and unstable learning.