Closed RDearnaley closed 10 months ago
Confirmed that the issue goes away if I turn off bitsandbytes quantization.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
TGI version: 1.1.0 LLM: Mistral 7B Instruct v0.1 Virtual hardware: Kubernetes on Azure
nvidia.com/mig-2g.20gb = A100 80GB GPU sliced down to to a MIG-2g slice (2/7 processing power, 2/8 = ~20GB GPU RAM) Operating system: Ubuntu Linux Kubernetes version: 1.26.6 Node size: Standard_NC24ads_A100_v4 Node image version: AKSUbuntu-2204gen2containerd-202310.04.0 Kernel version: 5.15.0-1049-azure Container runtime version: containerd://1.7.5-1 Manually created with command: az aks nodepool add --mode System --node-osdisk-size 512 --node-osdisk-type Managed --name --resource-group --cluster-name **** --node-count 1 --node-vm-size Standard_NC24ads_A100_v4 --gpu-instance-profile MIG2g
Information
Tasks
Reproduction
Using Kubernetes on Ubuntu:
With 1.1.0 (current) TGI version: image:
Environment variables set:
Mistral 7B (the settings make use of its sliding window attention, but it also occurs with MAX_TOTAL_TOKENS=4096, MAX_INPUT_LENGTH=2048, not making use of its sliding window. (I'm using bitandbytes quentization since setting the quantization to eetq produced errors saying it needed to be installed.)
Starting the model up produces normal-looking logs:
which have no obvious significant issues in them. Note the line
Setting max batch total tokens to 62624
.The first query sent to the server reliably produces a CUDA OOM error:
Unsuccessful remediations attempted: setting MAX_BATCH_TOTAL_TOKENS to a value less than 62624 doesn't work since this is a Flash Attention model and my value gets overridden by the inferred value (which presumably is too high). Setting CUDA_MEMORY_FRACTION doesn't work since, while it reduces the inferred max batch total tokens it's also used to torch.cuda.set_per_process_memory_fraction(…) which causes CUDA to send me an OOM when I hit a lower amount of memory used, so it lowers the memory ceiling at the same rate as it reduces the memory overallocation. I briefly experimented with setting:
but was unable to find a value that avoided the problem. In any case, hitting memory fragmentation seems unlikely as soon as the first query, unless it happened while downloading the model and converting it to safetensors.
Time-consuming possibly-relevant things I haven't yet tried: eliminating bitandbytes quantization from the setup, preconverting the Mistral 7B model to safetensors and putting that up on Hugging Face for my server replicas to download as safetensors so they don't have to do the conversion after download, an exhausive search of all possible max_split_size_mb values.
Presumably the inferred max batch total tokens calculation is wrong for this new model with these settings (unless we have a fast memory leak), but if so I don't understand it well enough to find the error and fix it. If the MAX_BATCH_TOTAL_TOKENS environment variable wasn't overwritten by the inferred value for Flash Attention models, or if we had a variable like CUDA_MEMORY_FRACTION but that only applied to the inferred max batch total tokens calculation and wasn't also used to set torch.cuda.set_per_process_memory_fraction(…), then I could manually tweak it to compensate for this. I could hack the server Python code to implement either of these things, but then I'd need to build my own Docker image with these hacks rather than relying on the official one.
Expected behavior
Server can perform inference without an immediate CUDA OOM on the first query (or indeed any early query).