huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

Vram not releasing #610

Closed Ichigo3766 closed 1 year ago

Ichigo3766 commented 1 year ago

H!

I would like to understand why the vram is not being released after a request completion? So something I noticed is that when I send queries the vram gets filled but after the answer is received, that query still remains inside the vram.

Now whats interesting is that, lets say you query the model but rather than letting it finish the response, you close the connection mid response, all the vram gets flushed and we are only left with the model vram which is perfect. So why is that not happening automatically after each request but a connection close in the middle of the request clears all the vram?

I have tested this multiple times from the langchain wrapper, a flask website, curl and all are able to output the same result. The model loads fine and starts at around 40gb and then lets say you do long conversations, it will go all the way up to 90gb and stay there and not flush but the moment i do the trick of closing the connection, the vram resets right back to 40gb.

bikal-netomi commented 1 year ago

Facing the same Issue.

Because the VRAM is not released, after subsequent n requests the server crashes with out of memory for me.

Currently after every n requests, it crashes and i restart the docker and repeat the cycle.

I monitored with watch nvidia-smi and I can see the memory climbing up with each request till the point of crash and this process so repeats.

Here's the docker command I used

docker run --gpus all --shm-size 20g -p 8080:80 --name tiiuae/falcon-40b-instruct --log-driver=local --log-opt max-size=10m --log-opt max-file=3 -v $volume:/data --env BUILD_EXTENSIONS=False --env NCCL_SHM_DISABLE=1 ghcr.io/huggingface/text-generation-inference:0.9 --model-id tiiuae/falcon-40b-instruct --num-shard 4 --quantize "bitsandbytes" --max-best-of 1 --max-total-tokens 2048 --max-input-length 2047 --trust-remote-code

OlivierDehaene commented 1 year ago

I would like to understand why the vram is not being released after a request completion?

It is!

Even though the nvidia-smi or other tools might not show it, PyTorch correctly destroy the tensors but this will not be displayed on nvidia-smi as PyTorch has its own allocator.

Basically when you create a new tensor on GPU, PyTorch allocates the memory on the device. When you destroy this tensor, PyTorch keep the memory space inside it's own cache allocator and doesn't release the device memory. This is done to speed up new allocations. Only calling torch.cuda.empty_cache releases the PyTorch cache allocator and display the true free memory in nvidia-smi.

We do this in some operations to avoid memory fragmentation after a long time but it is not required to do it after each completion.

See https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management for more info on this subject.

@bikal-netomi, if you are not running on 0.9.1, can you try it? I think that it's better to open a new issue on this as this problem might only be related to the bitsandbytes integration.

Ichigo3766 commented 1 year ago

Replying back as im still unsure about it because im facing this issue along with many others and its not just few gb but a whole 40-45gb worth of vram that is not being cleared and ends up causing the server to crash. I am not using any quantizing whatsoever.

Is there any suggestions that could solve this problem as i have tried lowering the tokens and everything and it just keeps filling the vram till it crashes unless i stop the connection mid stream which resets the whole vram back to normal. But as you would know this is not an efficient thing to do. Its causing issues as im having to restart the server multiple times a day due to the context load.

Thank you!

OlivierDehaene commented 1 year ago
  1. Can you add the info from the bug report template? you can open a new issue if that's easier for you.
  2. I'm having a hard time understanding when you OOM. Is it when with a high rps number? Or you always run one request at a time?
Ichigo3766 commented 1 year ago

1. The full command line used that causes issues: docker run --gpus all --shm-size 40g -p 1000:80 -v $volume:/data hug:gptq --model-id TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16 --num-shard 4 --trust-remote-code --max-input-length 4096 --max-total-tokens 10000 --max-batch-total-tokens 10000

OS version: Amazon Linux 2

Model being used TheBloke/WizardLM-13B-V1-1-SuperHOT-8K-fp16

Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 4 Nvidia A10G's totaling 96gb vram on AWS EC2 instance

Built the docker image locally pulling the most recent changes as of now

  1. Depending on the context being sent, ranging from 100-4k tokens, the vram will keep going up and lets say im sending like 4k context each query, it will crash after like 10-20 requests. If multiple people using it, it goes up even faster. I have tested both multi user scenario and I can solo (1 req at a time) have a conversation with the bot and eventually OOMS. Its not about how many requests at a time is sent to the server, it more of a building up thing where the vram just keeps going up for each inference. The model starts with about 11gb on each gpu. The below image is as of now with like 2-3 question asked with maximum of 70 tokens all together.

image

Now I will ask multiple queries and show below of how fast the vram goes up and stays there

image

It took 8 queries of around 2k ish tokens each to crash the server. The queries were sent one at a time. Note: While this memory is going up, I am not able to use the vram at all by anything else and the responses start slowing down as well.

image

2023-07-15T04:17:14.473703Z ERROR batch{batch_size=1}:prefill:prefill{id=45 size=1}:prefill{id=45 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Allocation on device 3 would exceed allowed memory. (out of memory)

Whats interesting is how the last gpu memory gets dropped even though when the server crashes, all gpu's were maxed out and only the last gpu drops memory but still the server is unable to allocate.

Hopefully this helps!

Hugoch commented 1 year ago

Hi,

I encountering same issue starting with TGI 0.9+.

System Info OS version: Ubuntu 22.04.2 LTS Model:

{
  "model_id": "tiiuae/falcon-40b",
  "model_sha": "561820f7eef0cc56a31ea38af15ca1acb07fab5d",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1535,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 16000,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.9.1",
  "sha": "31b36cca21fcd0e6b7db477a7545063e1b860156",
  "docker_label": "sha-31b36cc"
}

Hardware used:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:5E:00.0 Off |                    0 |
| N/A   33C    P0              30W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  | 00000000:86:00.0 Off |                    0 |
| N/A   53C    P0              86W / 350W |  81001MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 PCIe               On  | 00000000:D8:00.0 Off |                    0 |
| N/A   48C    P0              78W / 310W |  81001MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    1   N/A  N/A      9284      C   /opt/conda/bin/python3.9                  80988MiB |
|    2   N/A  N/A      9285      C   /opt/conda/bin/python3.9                  80988MiB |
+---------------------------------------------------------------------------------------+

Runtime TGI Docker 0.9.1

Reproduction

  1. Run TGI
    docker run -it --rm --gpus '"device=1,2"' --shm-size 1g -p 10000:80 -v $(pwd)/data:/data -e USE_FLASH_ATTENTION=true  ghcr.io/huggingface/text-generation-inference:0.9.1 --model-id tiiuae/falcon-40b --num-shard 2 --max-batch-total-tokens 16000 --max-input-length 1535 --max-total-tokens 2048
  2. Start benchmark
    docker exec -it 2fc7d8dba690 text-generation-benchmark --tokenizer-name tiiuae/falcon-40b -b1 -b4 -b8 -b16 -b32 -b64 -b128 -b256 -b 384 -b512
  3. Benchmark runs and keeps on filling memory up to bs 384 and OOM. No issue on TGI 0.8.2. Could it be linked to PagedAttention implementation? Screenshot 2023-07-15 at 11 21 55 Screenshot 2023-07-15 at 11 31 35

Hope it helps!

OlivierDehaene commented 1 year ago

Could you try ghcr.io/huggingface/text-generation-inference:sha-a2cf1bd and see if it solves your issue?

arlima commented 1 year ago

Hi, Same problem here (sha-a2cf1bd) : image

No problem with 0.8.2.

Ichigo3766 commented 1 year ago

@OlivierDehaene Thank you so much! I believe that fixed the issue. I will leave this open for others to try it and then close if everything is good!

Hugoch commented 1 year ago

@OlivierDehaene I can confirm that issue is fixed on my side too. Thanks a lot!

OlivierDehaene commented 1 year ago

That's still very weird. We shouldn't have to empty the cache manually like this. I will see if the torch folks have a better answer than this and report back to this issue.

bikal-netomi commented 1 year ago

@OlivierDehaene Thank you for the fix.