huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

AsyncInferenceClient - Unclosed client session #1701

Closed mhillebrand closed 5 months ago

mhillebrand commented 6 months ago

I'm using TGI with Flan-T5 to process thousands of text extraction requests at a time, on a 4 x A6000 machine. My client class, which uses AsyncInferenceClient, can handle 900 requests at once, but when I try to process 1,000 at once, I receive this error:

Traceback (most recent call last):
  File "/home/matt/miniconda3/envs/vllm/lib/python3.11/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/vllm/lib/python3.11/asyncio/base_events.py", line 1086, in create_connection
    raise exceptions[0]
  File "/home/matt/miniconda3/envs/vllm/lib/python3.11/asyncio/base_events.py", line 1070, in create_connection
    sock = await self._connect_sock(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/vllm/lib/python3.11/asyncio/base_events.py", line 951, in _connect_sock
    sock = socket.socket(family=family, type=type_, proto=proto)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/vllm/lib/python3.11/socket.py", line 232, in __init__
    _socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 24] Too many open files
...
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 127.0.0.1:8003 ssl:default [Too many open files]
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fe02c2224d0>
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fe02c292650>
Unclosed client session
...

Here's my code:

import asyncio
from huggingface_hub import AsyncInferenceClient

class ExtractorClient:
    def __init__(self):
        self.client = AsyncInferenceClient(model='http://127.0.0.1:8003')

    async def make_requests(self, prompts, temperature, max_new_tokens):
        tasks = [self.client.text_generation(prompt=prompt, temperature=temperature, max_new_tokens=max_new_tokens) for prompt in prompts]
        return await asyncio.gather(*tasks)

    def extract(self, prompts, temperature=0.001, max_new_tokens=512):
        if isinstance(prompts, str):
            prompts = [prompts]

        return asyncio.run(self.make_requests(prompts, temperature, max_new_tokens))

Docker bash script:

docker run \
   --runtime=nvidia --gpus all --shm-size 1g \
   -p 8003:80 \
   -v /opt/extract/model:/model \
   --pull always --rm -d ghcr.io/huggingface/text-generation-inference:1.4.5 \
   --model-id /model/final \
   --num-shard 4 \
   --max-concurrent-requests 4000 \
   --max-input-length 512 \
   --max-batch-total-tokens 3000000 \
   --dtype bfloat16

At first, the "Too many open files" error made me think I need to tweak the ulimit value in the OS, but Docker is inheriting my 1048576 value for that, I think. Regardless, I tried adding --ulimit nofile=1048576:1048576 to my docker run script—with no luck.

Does the "Unclosed client session" error mean that I need to use a context manager with aiohttp.ClientSession somehow? If so, I'm not sure how to do that with AsyncInferenceClient. I tried adding a semaphore like this, but it didn't help:

    async def make_requests(self, prompts, temperature, max_new_tokens):
        async with asyncio.Semaphore(100):
            tasks = [self.client.text_generation(prompt=prompt, temperature=temperature, max_new_tokens=max_new_tokens) for prompt in prompts]
        return await asyncio.gather(*tasks)

huggingface-hub==0.22.2 text-generation-inference==1.4.5 aiohttp==3.9.3 python==3.11.8 docker==26.0.0

mhillebrand commented 6 months ago

A workaround for now is breaking up my prompts into batches of 100, but I feel like I'm doing something wrong underneath this band-aid.

mhillebrand commented 6 months ago

I realized that sharding across multiple GPUs with such a small model actually slows things down, so now I'm creating one TGI instance per GPU. I also increased my batch size to 200, cranked max-batch-prefill-tokens up to 50,000 and enabled CUDA graphs. I'm seeing much, much better performance now. However, the "Unclosed client session" error still appears, but it's pretty rare.

maziyarpanahi commented 6 months ago

enabled CUDA graphs.

I am really interested in this part, I haven't tried it yet. Overall, are you happy with CUDA graphs feature?

sapountzis commented 5 months ago

enabled CUDA graphs.

I am really interested in this part, I haven't tried it yet. Overall, are you happy with CUDA graphs feature?

When I enabled CUDA graphs my model is producing only <unk> tokens.

Has anyone else come across the issue?

OlivierDehaene commented 5 months ago

@sapountzis we had an issue in the past with cuda graphs + quantization. Where you using that by any chance? It should be fixed in v2.0.0.

sapountzis commented 5 months ago

@OlivierDehaene Yes! I was using bnb + cuda graphs. I will test again with 2.0

sapountzis commented 5 months ago

I got {"timestamp":"2024-04-15T09:33:05.710145Z","level":"INFO","fields":{"message":"Bitsandbytes doesn't work with cuda graphs, deactivating them"},"target":"text_generation_launcher"}

mhillebrand commented 5 months ago

@sapountzis we had an issue in the past with cuda graphs + quantization. Where you using that by any chance? It should be fixed in v2.0.0.

@OlivierDehaene What about the Unclosed client session and Too many open files problem for which this issue was created?

OlivierDehaene commented 5 months ago

@sapountzis, we are deprecating bnb in favour of eetq as it is way faster.

@mhillebrand, the "Too many open files" is hapenning for the client so tweaking the container ulimit will have no impact. Are you sure your user doesn't have a lower hard limit?

ulimit -Ha
mhillebrand commented 5 months ago

@OlivierDehaene

$ ulimit -Ha
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 1029631
max locked memory           (kbytes, -l) 32969336
max memory size             (kbytes, -m) unlimited
open files                          (-n) 524288
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 123456789
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
mhillebrand commented 5 months ago

Hmm, I think this issue may have been caused by me misusing asyncio event loops. I've altered the way I use asyncio, and I no longer see the error.

maziyarpanahi commented 5 months ago

Hmm, I think this issue may have been caused by me misusing asyncio event loops. I've altered the way I use asyncio, and I no longer see the error.

Since using asyncio is pretty popular, to avoid this happening, could you please share what caused it and what fixed it?

Thank you in advance

mhillebrand commented 5 months ago

@maziyarpanahi

I was foolishly calling asyncio.run() for each invocation from within a persistent client class. I'm now creating my own event loop in the init method: self.loop = asyncio.get_event_loop(), and I replaced asyncio.run() with self.loop.run_until_complete().

I tweaked several TGI parameters as well:

#!/usr/bin/env bash
#
# Usage:
#   ./extractor.sh [port]  (where port is 8500-8503)

port="${1:-8500}"
gpu=${port: -1}

docker run \
   --runtime=nvidia \
   --gpus device=${gpu} \
   --shm-size 1g \
   --ulimit nofile=9999:9999 \
   -p ${port}:80 \
   -v /opt/extract/model:/model \
   --pull always --rm -d ghcr.io/huggingface/text-generation-inference:2.0 \
   --sharded false \
   --model-id /model/final \
   --max-best-of 1 \
   --max-concurrent-requests 9999 \
   --max-input-tokens 400 \
   --max-total-tokens 512 \
   --max-batch-size 9999 \
   --max-batch-prefill-tokens 50000 \
   --max-batch-total-tokens 3000000 \
   --dtype bfloat16
mhillebrand commented 5 months ago

enabled CUDA graphs.

I am really interested in this part, I haven't tried it yet. Overall, are you happy with CUDA graphs feature?

@maziyarpanahi Yes, and now it's enabled by default in TGI 2.0.0.

maziyarpanahi commented 5 months ago

Many thanks @mhillebrand for confirming both questions. 👍🏼