Closed mhillebrand closed 5 months ago
A workaround for now is breaking up my prompts into batches of 100, but I feel like I'm doing something wrong underneath this band-aid.
I realized that sharding across multiple GPUs with such a small model actually slows things down, so now I'm creating one TGI instance per GPU. I also increased my batch size to 200, cranked max-batch-prefill-tokens
up to 50,000 and enabled CUDA graphs. I'm seeing much, much better performance now. However, the "Unclosed client session" error still appears, but it's pretty rare.
enabled CUDA graphs.
I am really interested in this part, I haven't tried it yet. Overall, are you happy with CUDA graphs
feature?
enabled CUDA graphs.
I am really interested in this part, I haven't tried it yet. Overall, are you happy with
CUDA graphs
feature?
When I enabled CUDA graphs my model is producing only <unk>
tokens.
Has anyone else come across the issue?
@sapountzis we had an issue in the past with cuda graphs + quantization. Where you using that by any chance? It should be fixed in v2.0.0.
@OlivierDehaene Yes! I was using bnb + cuda graphs. I will test again with 2.0
I got
{"timestamp":"2024-04-15T09:33:05.710145Z","level":"INFO","fields":{"message":"Bitsandbytes doesn't work with cuda graphs, deactivating them"},"target":"text_generation_launcher"}
@sapountzis we had an issue in the past with cuda graphs + quantization. Where you using that by any chance? It should be fixed in v2.0.0.
@OlivierDehaene What about the Unclosed client session
and Too many open files
problem for which this issue was created?
@sapountzis, we are deprecating bnb in favour of eetq as it is way faster.
@mhillebrand, the "Too many open files" is hapenning for the client so tweaking the container ulimit will have no impact. Are you sure your user doesn't have a lower hard limit?
ulimit -Ha
@OlivierDehaene
$ ulimit -Ha
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1029631
max locked memory (kbytes, -l) 32969336
max memory size (kbytes, -m) unlimited
open files (-n) 524288
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 123456789
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hmm, I think this issue may have been caused by me misusing asyncio event loops. I've altered the way I use asyncio, and I no longer see the error.
Hmm, I think this issue may have been caused by me misusing asyncio event loops. I've altered the way I use asyncio, and I no longer see the error.
Since using asyncio
is pretty popular, to avoid this happening, could you please share what caused it and what fixed it?
Thank you in advance
@maziyarpanahi
I was foolishly calling asyncio.run()
for each invocation from within a persistent client class. I'm now creating my own event loop in the init method: self.loop = asyncio.get_event_loop()
, and I replaced asyncio.run()
with self.loop.run_until_complete()
.
I tweaked several TGI parameters as well:
#!/usr/bin/env bash
#
# Usage:
# ./extractor.sh [port] (where port is 8500-8503)
port="${1:-8500}"
gpu=${port: -1}
docker run \
--runtime=nvidia \
--gpus device=${gpu} \
--shm-size 1g \
--ulimit nofile=9999:9999 \
-p ${port}:80 \
-v /opt/extract/model:/model \
--pull always --rm -d ghcr.io/huggingface/text-generation-inference:2.0 \
--sharded false \
--model-id /model/final \
--max-best-of 1 \
--max-concurrent-requests 9999 \
--max-input-tokens 400 \
--max-total-tokens 512 \
--max-batch-size 9999 \
--max-batch-prefill-tokens 50000 \
--max-batch-total-tokens 3000000 \
--dtype bfloat16
enabled CUDA graphs.
I am really interested in this part, I haven't tried it yet. Overall, are you happy with
CUDA graphs
feature?
@maziyarpanahi Yes, and now it's enabled by default in TGI 2.0.0.
Many thanks @mhillebrand for confirming both questions. 👍🏼
I'm using TGI with Flan-T5 to process thousands of text extraction requests at a time, on a 4 x A6000 machine. My client class, which uses
AsyncInferenceClient
, can handle 900 requests at once, but when I try to process 1,000 at once, I receive this error:Here's my code:
Docker bash script:
At first, the "Too many open files" error made me think I need to tweak the
ulimit
value in the OS, but Docker is inheriting my1048576
value for that, I think. Regardless, I tried adding--ulimit nofile=1048576:1048576
to my docker run script—with no luck.Does the "Unclosed client session" error mean that I need to use a context manager with
aiohttp.ClientSession
somehow? If so, I'm not sure how to do that withAsyncInferenceClient
. I tried adding a semaphore like this, but it didn't help:huggingface-hub==0.22.2 text-generation-inference==1.4.5 aiohttp==3.9.3 python==3.11.8 docker==26.0.0