Wei-Lin-Intel commented 3 weeks ago

What does this PR do?

This PR fixes the various errors for the code path of beam search with bucket_size, reuse_cache and bucket_internal. Currently it can only support the models with the model types of llama and qwen2 when reuse_cache is disabled since the reorder_cache should be added in the modeling.

Test command

python run_generation.py --model_name_or_path Qwen/Qwen2-7b-Instruct --use_hpu_graphs --use_kv_cache --trim_logits --use_flash_attention --max_input_tokens 128 --max_new_tokens 128 --batch_size 4 --limit_hpu_graphs --reuse_cache --bucket_internal --bucket_size 128 --bf16 --num_beams 3

python run_generation.py --model_name_or_path Qwen/Qwen2-7b-Instruct --use_hpu_graphs --use_kv_cache --trim_logits --use_flash_attention --max_input_tokens 128 --max_new_tokens 128 --batch_size 4 --limit_hpu_graphs --bucket_size 128 --bf16 --num_beams 3

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Wei-Lin-Intel commented 3 weeks ago

@ssarkar2 @libinta Please help to review it, thanks!

skaulintel commented 5 days ago

@Wei-Lin-Intel I verified that the commands you provided. Also verified them if I swap the model-name for Llama-2-7b-hf. Can you explain what the --bucket_internal and --reuse_cache options do and their expected affect on performance?

I see similar throughput, but a reduction in max memory allocated and first token latency when I use those flags for llama.

Wei-Lin-Intel commented 5 days ago

@Wei-Lin-Intel I verified that the commands you provided. Also verified them if I swap the model-name for Llama-2-7b-hf. Can you explain what the --bucket_internal and --reuse_cache options do and their expected affect on performance?

I see similar throughput, but a reduction in max memory allocated and first token latency when I use those flags for llama.

reuse_cache means the inference would pre-allocate the KV cache buffer in the model, thus it does not require to return the entire KV cache tensor for the generation. bucket_internal is used to reduce Attention computations when combined with reuse_cache, it would use cache_idx to select KV cache, not the entire length.

github-actions[bot] commented 4 days ago

The code quality check failed, please run make style.

HuggingFaceDocBuilderDev commented 4 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wei-Lin-Intel commented 14 hours ago

I think we should add a test for beam search here: https://github.com/huggingface/optimum-habana/blob/main/tests/test_text_generation_example.py

Will add it

huggingface / optimum-habana

Support beam search with reuse_cache and bucket_internal #1472

What does this PR do?

Test command

Before submitting