With this PR, the bucket_idx (the scalar chunk which multiplies each bucket sum) is now the same regardless of the value of j per SMVP thread.
This makes it possible for us to replace double-and-add per SMVP thread with a separate running-sum thread after SMVP, since each bucket can be multiplied by a scalar chunk to get the correct result.
With this PR, the
bucket_idx
(the scalar chunk which multiplies each bucket sum) is now the same regardless of the value ofj
per SMVP thread.This makes it possible for us to replace double-and-add per SMVP thread with a separate running-sum thread after SMVP, since each bucket can be multiplied by a scalar chunk to get the correct result.