Closed alexanderdicke-webcom closed 3 months ago
Thanks for the report @alexanderdicke-webcom!
After discussing it briefly with Olivier it seems linked to the torch allocator acting up. You're right that after warming up, it should be able to handle the sequences you pass to the model, so we'll need to take a look.
cc @OlivierDehaene
I got the similar oom issues (34B llama on 2*A6000) which happened in sha-74b0231 but works in sha-eade737.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
TGI Version: v2.0.4 Model: mistralai/Mixtral-8x22B-Instruct-v0.1 Hardware: 4x Nvidia H100 70GB HBM3 Deployment specificities: OpenShift
Information
Tasks
Reproduction
Running TGI with
MAX_BATCH_PREFILL_TOKENS=35000
MAX_INPUT_LENGTH=35000
MAX_TOTAL_TOKENS=36864
NUM_SHARD=4
The warmup is successful:
The model successfully handles relatively small requests. When sending larger requests (close to the maximal input length but still below) the prefill operation crashes:
I experimented with setting
MAX_BATCH_SIZE=1
to be sure that that the number of tokens in the batch is below the max batch total tokens calculated by TGI. However, the error still occurs.Expected behavior
There is no OOM since the number of max batch total tokens is correct.