Open Wei-Lin-Intel opened 3 weeks ago
@ssarkar2 @libinta Please help to review it, thanks!
@Wei-Lin-Intel I verified that the commands you provided. Also verified them if I swap the model-name for Llama-2-7b-hf. Can you explain what the --bucket_internal and --reuse_cache options do and their expected affect on performance?
I see similar throughput, but a reduction in max memory allocated and first token latency when I use those flags for llama.
@Wei-Lin-Intel I verified that the commands you provided. Also verified them if I swap the model-name for Llama-2-7b-hf. Can you explain what the --bucket_internal and --reuse_cache options do and their expected affect on performance?
I see similar throughput, but a reduction in max memory allocated and first token latency when I use those flags for llama.
reuse_cache means the inference would pre-allocate the KV cache buffer in the model, thus it does not require to return the entire KV cache tensor for the generation. bucket_internal is used to reduce Attention computations when combined with reuse_cache, it would use cache_idx to select KV cache, not the entire length.
The code quality check failed, please run make style
.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I think we should add a test for beam search here: https://github.com/huggingface/optimum-habana/blob/main/tests/test_text_generation_example.py
Will add it
What does this PR do?
This PR fixes the various errors for the code path of beam search with bucket_size, reuse_cache and bucket_internal. Currently it can only support the models with the model types of llama and qwen2 when reuse_cache is disabled since the reorder_cache should be added in the modeling.
Test command
python run_generation.py --model_name_or_path Qwen/Qwen2-7b-Instruct --use_hpu_graphs --use_kv_cache --trim_logits --use_flash_attention --max_input_tokens 128 --max_new_tokens 128 --batch_size 4 --limit_hpu_graphs --reuse_cache --bucket_internal --bucket_size 128 --bf16 --num_beams 3
python run_generation.py --model_name_or_path Qwen/Qwen2-7b-Instruct --use_hpu_graphs --use_kv_cache --trim_logits --use_flash_attention --max_input_tokens 128 --max_new_tokens 128 --batch_size 4 --limit_hpu_graphs --bucket_size 128 --bf16 --num_beams 3
Before submitting