Fixes a bug in the position id adjustment when using kv caching and padding together. Previously, the slicing happened to the attention_mask == 0 part, but should have happened outside of the cumsum part. This only had an effect when doing both kv caching and padding together, so definitely should not affect training, only generation.
Fixes a bug in the position id adjustment when using kv caching and padding together. Previously, the slicing happened to the
attention_mask == 0
part, but should have happened outside of thecumsum
part. This only had an effect when doing both kv caching and padding together, so definitely should not affect training, only generation.