huggingface / huggingface-llama-recipes

531 stars 59 forks source link

Modification for Model Generation in `quantized_cache.py` #77

Open Mefisto04 opened 3 weeks ago

Mefisto04 commented 3 weeks ago

To enhance the functionality and usability of the model generation code located in performance_optimization/quantized_cache.py, the following optimizations are proposed:

  1. Handle Multiple Prompts: Implement a function to process multiple prompts in a single execution, allowing batch processing and improving efficiency.

  2. Control Over Output Length and Sampling: Add parameters to allow users to specify the maximum output length and whether to use sampling or greedy decoding for text generation.

  3. Batch Processing: Optimize the code to process inputs in batches, reducing the overhead of multiple calls to the model and improving performance.

please assign this to me so that i can contribute in it.