huggingface / huggingface-llama-recipes

508 stars 57 forks source link

Models's output from recipe with prompt reuse optimization does not match the non-cached generation #78

Open sannat17 opened 1 week ago

sannat17 commented 1 week ago

In performance_optimization/prompt_reuse.py, the current method of storing the cached prompt does not correctly discard the KV cache for the last token (and instead follows the same caching recipe as required for model.generate).

For context, look at these comments and discussions:

After running some preliminary tests, the current prompt_reuse.py recipe consistently generates different outputs than the non-cached generation, while using the method from the linked github issue produces consistent generations.

sannat17 commented 6 days ago

Note: