However, in the chunkwise retention, the normalization uses .abs().sum(). From my perspective, .abs().sum() is better than .sum().abs() for the normalization denominator since real values may cancel with each other during the summation. So is it a typo here?
In the parallel retention code, the normalization denominator uses
.sum(dim=-1, keepdim=True).abs()
However, in the chunkwise retention, the normalization uses
.abs().sum()
. From my perspective, .abs().sum() is better than .sum().abs() for the normalization denominator since real values may cancel with each other during the summation. So is it a typo here?