Fix slicing for padding + cache

Fixes a bug in the position id adjustment when using kv caching and padding together. Previously, the slicing happened to the attention_mask == 0 part, but should have happened outside of the cumsum part. This only had an effect when doing both kv caching and padding together, so definitely should not affect training, only generation.

mosaicml / examples

Fix slicing for padding + cache #257