Models's output from recipe with prompt reuse optimization does not match the non-cached generation

In performance_optimization/prompt_reuse.py, the current method of storing the cached prompt does not correctly discard the KV cache for the last token (and instead follows the same caching recipe as required for model.generate).

For context, look at these comments and discussions:

https://github.com/huggingface/transformers/issues/24841#issuecomment-1889972628
https://discuss.huggingface.co/t/how-to-cache-common-instruction-prompt/101419?u=sannat17

After running some preliminary tests, the current prompt_reuse.py recipe consistently generates different outputs than the non-cached generation, while using the method from the linked github issue produces consistent generations.

huggingface / huggingface-llama-recipes

Models's output from recipe with prompt reuse optimization does not match the non-cached generation #78