Closed yao531441 closed 3 days ago
@yuanwu2017 is looking into it.
I can reproduce this issue. It is OOM issue. Debugging in progress.
In this case the warmup is not done because of the incorrect parameters of startup. I changed the parameters of startup, the OOM happened in warmup process. So it means llama2-7B cannot run with batch_size 32, it causes the OOM issue. PREFILL_BATCH_BUCKET_SIZE=1 Because the max_prefill_batch_size = max-batch-prefill-tokens/max-input-lentgh=2048/2048=1
max_decode_batch_size=max-batch-total-tokens/max-total-tokens=65536/4096=16, but BATCH_BUCKET_SIZE=32. You need to set max-batch-total-tokens as 131072.
docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 131072 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
Optimum-habana also cannot support the batch_size=32 and max_input_length=2048.
https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation
Command:
python run_generation.py --model_name_or_path meta-llama/Llama-2-7b-chat-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 2048 --max_input_tokens 2048 --do_sample --batch_size 32 --prompt "How are you?" --bf16
@regisss @mandy-li Please close this issue.
@yao-matrix
System Info
Information
Tasks
Reproduction
docker run -p 18080:80 --runtime=habana -v /data/huggingface/hub:/data -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=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 -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true -e PREFILL_BATCH_BUCKET_SIZE=2 -e BATCH_BUCKET_SIZE=32 -e PAD_SEQUENCE_TO_MULTIPLE_OF=256 -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.4 --model-id meta-llama/Llama-2-7b-chat-hf --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 2048 --max-batch-total-tokens 65536 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 64
Error log
Expected behavior
TGI serve will return correct output result.