Open pseudotensor opened 3 months ago
65k starts to work, gets closer, but even that fails!
docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
--shm-size 10.24gb \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e TRANSFORMERS_CACHE="/.cache/" -p \
5005:80 \
-v $HOME/.cache:/.cache/ \
-v $HOME/.cache/huggingface/hub/:/data \
--name llama31-70b-tgi \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
--max-input-length 66560 \
--max-total-tokens 74752 \
--max-stop-sequences 6 \
--num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt
gives:
RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.553191Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.91 GiB. GPU has a total capacity of 79.33 GiB of which 1.41 GiB is free. Process 1404711 has 77.91 GiB memory in use. Of the allocated memory 76.30 GiB is allocated by PyTorch, and 27.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.689631Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.699306Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.702328Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.728006Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2? Seems some bad math going on.
Only 32k actually started:
docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
--shm-size 10.24gb \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e TRANSFORMERS_CACHE="/.cache/" -p \
5005:80 \
-v $HOME/.cache:/.cache/ \
-v $HOME/.cache/huggingface/hub/:/data \
--name llama31-70b-tgi \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
--max-input-length 32768 \
--max-total-tokens 40960 \
--max-stop-sequences 6 \
--num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt
TGI does not support it now ,updates are so slow
Same problem on llama3.1-70b unquantized on 8xA6000:
..anything above --max-input-tokens=38412
causes OOM (each GPU goes to 36GB used vram of 48GB total during load, then OOM happens during the warmup phase in v2.2.0 docker. smaller values scrape through)
..After warmup, vram usage drops to 21GB per GPU and it works fine (but with 384 GB vram total you'd think 128k context should be possible):
sudo docker run --rm --name meta-llama_Meta-Llama-3.1-70B-Instruct
--gpus all
--shm-size 4g
-p 7861:80
--ipc host
-v $HOME/.cache:/.cache/
-v $HOME/.cache/huggingface/hub/:/data
-e VALIDATION_WORKERS=15
-e FLASH_DECODING=1
ghcr.io/huggingface/text-generation-inference:sha-db7e043
--model-id meta-llama/Meta-Llama-3.1-70B-Instruct
--hostname 0.0.0.0
--num-shard 8
--max-total-tokens 42508
--max-input-tokens 40460
--max-batch-size 1
--cuda-graphs 1
output:
2024-07-25T09:49:25.860540Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 40460
2024-07-25T09:49:25.860548Z INFO text_generation_launcher: Sharding model on 8 processes
...
2024-07-25T09:50:57.501322Z INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T09:51:41.686292Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.67 GiB. GPU has a total capacity of 47.44 GiB of which 9.30 GiB is free. Process 1462226 has 38.13 GiB memory in use. Of the allocated memory 37.41 GiB is allocated by PyTorch, and 276.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1103, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 1 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
..then when i set --max-input-tokens=38412, --max-total-tokens=42508
it connects, but not sure where it is getting this max batch total tokens value 69888
from:
2024-07-25T10:08:45.810270Z INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T10:09:29.103273Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1]
2024-07-25T10:09:29.651189Z INFO text_generation_router::server: router/src/server.rs:1599: Using scheduler V3
2024-07-25T10:09:29.651204Z INFO text_generation_router::server: router/src/server.rs:1651: Setting max batch total tokens to 69888
2024-07-25T10:09:30.484670Z INFO text_generation_router::server: router/src/server.rs:1889: Connected
..something about the load & warmup using more VRAM per-GPU than it should, when context is large?
Having the same issue. I run into OOM errors even when running llama 3.1 8b with 128k context on 2 80Gb A100. Feels like something in the prefill is taking up more VRAM
Facing similar issue. I'm using 4xA100 80GB but it's throwing the same issue when trying to set context length more than 40k. Is there any fix for this?
Same issue. Commenting for visibility.
same here for 3.1-70b. just adding that I'm using AWQ and can only run something like ~23k tokens on 2x a6000 ada (96 GB total VRAM), while using VLLM I can run the full 128k no issue.
same issue on 4xA100 80gb
I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)
THe automatic inference of max-batch-prefill-tokens during the warmup phase is exceeding the VRAM. There seems to be no easy way to control the automatic estimation of that.
Same issue on 4xA5000 (with Marlin FP8 quantization).
Hi everyone π
Sorry for such a late reply. Thanks for reporting this issue and bringing it to our attention. We're currently rewriting a bunch of things and a fix for this is among those π
It seems that vLLM forces a prefix chunk of 32k (which TGI doesn't) which causes the discrepancy.
Any update on the timing around this?
@chuddlestonCBANC it's in the works π https://github.com/huggingface/text-generation-inference/pull/2402
@ErikKaum After #2402 got merged, I still can't fit Llama 3.1 on my 4xA6000
The log says that prefix caching is active:
tgi-llama3.1-70b-1 | 2024-08-20T11:52:04.171854Z INFO text_generation_launcher: Using prefix caching = True
tgi-llama3.1-70b-1 | 2024-08-20T11:52:04.171911Z INFO text_generation_launcher: Using Attention = flashinfer
But even with only 16k input and 32k total tokens i get a CUDA out of Memory Error. With vLLM i can get 80k tokens context length on the same server.
tgi-llama3.1-70b-1 | Traceback (most recent call last):
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1251, in warmup
tgi-llama3.1-70b-1 | _, batch, _ = self.generate_token(batch)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llama3.1-70b-1 | return func(*args, **kwds)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1693, in generate_token
tgi-llama3.1-70b-1 | prefill_logprobs_tensor = torch.log_softmax(out, -1)
tgi-llama3.1-70b-1 | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.93 GiB. GPU 2 has a total capacity of 47.53 GiB of which 1.18 GiB is free. Process 790346 has 46.33 GiB memory in use. Of the allocated memory 45.87 GiB is allocated by PyTorch, and 30.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
tgi-llama3.1-70b-1 |
tgi-llama3.1-70b-1 | The above exception was the direct cause of the following exception:
tgi-llama3.1-70b-1 |
tgi-llama3.1-70b-1 | Traceback (most recent call last):
tgi-llama3.1-70b-1 | File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llama3.1-70b-1 | sys.exit(app())
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llama3.1-70b-1 | return get_command(self)(*args, **kwargs)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llama3.1-70b-1 | return self.main(*args, **kwargs)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llama3.1-70b-1 | return _main(
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llama3.1-70b-1 | rv = self.invoke(ctx)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llama3.1-70b-1 | return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llama3.1-70b-1 | return ctx.invoke(self.callback, **ctx.params)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llama3.1-70b-1 | return __callback(*args, **kwargs)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llama3.1-70b-1 | return callback(**use_params) # type: ignore
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
tgi-llama3.1-70b-1 | server.serve(
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
tgi-llama3.1-70b-1 | asyncio.run(
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llama3.1-70b-1 | return loop.run_until_complete(main)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llama3.1-70b-1 | self.run_forever()
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llama3.1-70b-1 | self._run_once()
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llama3.1-70b-1 | handle._run()
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llama3.1-70b-1 | self._context.run(self._callback, *self._args)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llama3.1-70b-1 | return await self.intercept(
tgi-llama3.1-70b-1 | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llama3.1-70b-1 | return await response
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
tgi-llama3.1-70b-1 | raise error
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
tgi-llama3.1-70b-1 | return await behavior(request_or_iterator, context)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Warmup
tgi-llama3.1-70b-1 | max_supported_total_tokens = self.model.warmup(batch)
tgi-llama3.1-70b-1 | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in warmup
tgi-llama3.1-70b-1 | raise RuntimeError(
tgi-llama3.1-70b-1 | RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
This is my docker compose file:
services:
tgi-llama3.1-70b:
# image: ghcr.io/huggingface/text-generation-inference
build:
context: .
dockerfile: Dockerfile
restart: always
shm_size: 64g
env_file: .env
environment:
TRUST_REMOTE_CODE: true
MODEL_ID: meta-llama/Meta-Llama-3.1-70B-Instruct
HUGGINGFACE_HUB_CACHE: /data
MAX_TOTAL_TOKENS: 32768
MAX_INPUT_TOKENS: 16384
MAX_STOP_SEQUENCES: 5
USE_PREFIX_CACHING: true
FLASH_INFER: true
volumes:
- /data/huggingface/hub/:/data
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4
capabilities: [ gpu ]
And the docker file
FROM ghcr.io/huggingface/text-generation-inference
RUN pip install --no-cache-dir flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
ENTRYPOINT ["/tgi-entrypoint.sh"]
@ErikKaum any plans to look at the OOM issues with large contexts? I still get the OOM (mentioned above) regardless of prefix caching on the latest Docker images it seems.
@ErikKaum @freegheist Yeah, I was evaluating this and trying to do napkin math for gpu memory.
I am unable to run LLama3.1-8b even at 64k on an A100.
This sheet from Meta seems to imply 128k should only take 16GB of VRAM
Hi @freegheist π
Sorry for being unclear, so the PR was about prefix caching but we still need the prefix chunking in. We've had some issues with it so it's been a back and forth.
Can't promise when it's in but we're working hard to get it out π€
Same issue. Commenting for increasing the priority.
Same issue here.
Same issue.
Same issue
I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)
next stop, vllm!
@ErikKaum @drbh @Narsil Since many people are running into the same problem. Is there any plan to prioritize this bug?
Same issue!
Is there a fix planned for this? I'm still unable to increase the context length to more than 40k. Or is there a workaround to increase the context length?
System Info
2.2.0
Information
Tasks
Reproduction
On 4*H100:
get:
vLLM works fine without errors.
Expected behavior
able to launch and use without error like vLLM