Can't run llama3.1-70b at full context

pseudotensor commented 3 months ago

System Info

2.2.0

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

On 4*H100:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 131072 \
             --max-total-tokens 139264 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

get:

RuntimeError: Not enough memory to handle 131122 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

vLLM works fine without errors.

Expected behavior

able to launch and use without error like vLLM

pseudotensor commented 3 months ago

65k starts to work, gets closer, but even that fails!

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 66560 \
             --max-total-tokens 74752 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

gives:

RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.553191Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.91 GiB. GPU  has a total capacity of 79.33 GiB of which 1.41 GiB is free. Process 1404711 has 77.91 GiB memory in use. Of the allocated memory 76.30 GiB is allocated by PyTorch, and 27.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2024-07-24T17:32:16.689631Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.699306Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.702328Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-07-24T17:32:16.728006Z ERROR warmup{max_input_length=66560 max_prefill_tokens=66610 max_total_tokens=74752 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED

2? Seems some bad math going on.

pseudotensor commented 3 months ago

Only 32k actually started:

docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
source ~/h2ogpt_ops/gr_exports.sh
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 32768 \
             --max-total-tokens 40960 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt

coderchem commented 3 months ago

TGI does not support it now ,updates are so slow

freegheist commented 3 months ago

Same problem on llama3.1-70b unquantized on 8xA6000:

..anything above --max-input-tokens=38412 causes OOM (each GPU goes to 36GB used vram of 48GB total during load, then OOM happens during the warmup phase in v2.2.0 docker. smaller values scrape through)

..After warmup, vram usage drops to 21GB per GPU and it works fine (but with 384 GB vram total you'd think 128k context should be possible):

sudo docker run --rm --name meta-llama_Meta-Llama-3.1-70B-Instruct 
   --gpus all 
   --shm-size 4g 
   -p 7861:80 
   --ipc host 
   -v $HOME/.cache:/.cache/
   -v $HOME/.cache/huggingface/hub/:/data
   -e VALIDATION_WORKERS=15 
   -e FLASH_DECODING=1 
   ghcr.io/huggingface/text-generation-inference:sha-db7e043 
   --model-id meta-llama/Meta-Llama-3.1-70B-Instruct 
   --hostname 0.0.0.0 
   --num-shard 8 
   --max-total-tokens 42508 
   --max-input-tokens 40460 
   --max-batch-size 1 
   --cuda-graphs 1

output:

2024-07-25T09:49:25.860540Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 40460
2024-07-25T09:49:25.860548Z  INFO text_generation_launcher: Sharding model on 8 processes
...
2024-07-25T09:50:57.501322Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T09:51:41.686292Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1101, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1504, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.67 GiB. GPU  has a total capacity of 47.44 GiB of which 9.30 GiB is free. Process 1462226 has 38.13 GiB memory in use. Of the allocated memory 37.41 GiB is allocated by PyTorch, and 276.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1103, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 1 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

..then when i set --max-input-tokens=38412, --max-total-tokens=42508 it connects, but not sure where it is getting this max batch total tokens value 69888 from:

2024-07-25T10:08:45.810270Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-07-25T10:09:29.103273Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1]
2024-07-25T10:09:29.651189Z  INFO text_generation_router::server: router/src/server.rs:1599: Using scheduler V3
2024-07-25T10:09:29.651204Z  INFO text_generation_router::server: router/src/server.rs:1651: Setting max batch total tokens to 69888
2024-07-25T10:09:30.484670Z  INFO text_generation_router::server: router/src/server.rs:1889: Connected

..something about the load & warmup using more VRAM per-GPU than it should, when context is large?

Jason-CKY commented 3 months ago

Having the same issue. I run into OOM errors even when running llama 3.1 8b with 128k context on 2 80Gb A100. Feels like something in the prefill is taking up more VRAM

rishu931997 commented 3 months ago

Facing similar issue. I'm using 4xA100 80GB but it's throwing the same issue when trying to set context length more than 40k. Is there any fix for this?

nrepesh commented 3 months ago

Same issue. Commenting for visibility.

mjsteele12 commented 3 months ago

same here for 3.1-70b. just adding that I'm using AWQ and can only run something like ~23k tokens on 2x a6000 ada (96 GB total VRAM), while using VLLM I can run the full 128k no issue.

weihanfeng commented 3 months ago

same issue on 4xA100 80gb

maziyarpanahi commented 3 months ago

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

badrisnps commented 3 months ago

THe automatic inference of max-batch-prefill-tokens during the warmup phase is exceeding the VRAM. There seems to be no easy way to control the automatic estimation of that.

localmind-ai commented 3 months ago

Same issue on 4xA5000 (with Marlin FP8 quantization).

ErikKaum commented 3 months ago

Hi everyone 👋

Sorry for such a late reply. Thanks for reporting this issue and bringing it to our attention. We're currently rewriting a bunch of things and a fix for this is among those 👍

It seems that vLLM forces a prefix chunk of 32k (which TGI doesn't) which causes the discrepancy.

chuddlestonCBANC commented 3 months ago

Any update on the timing around this?

ErikKaum commented 3 months ago

@chuddlestonCBANC it's in the works 🙌 https://github.com/huggingface/text-generation-inference/pull/2402

raimannma commented 2 months ago

@ErikKaum After #2402 got merged, I still can't fit Llama 3.1 on my 4xA6000

The log says that prefix caching is active:

tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171854Z  INFO text_generation_launcher: Using prefix caching = True
tgi-llama3.1-70b-1  | 2024-08-20T11:52:04.171911Z  INFO text_generation_launcher: Using Attention = flashinfer

But even with only 16k input and 32k total tokens i get a CUDA out of Memory Error. With vLLM i can get 80k tokens context length on the same server.

tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1251, in warmup
tgi-llama3.1-70b-1  |     _, batch, _ = self.generate_token(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llama3.1-70b-1  |     return func(*args, **kwds)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1693, in generate_token
tgi-llama3.1-70b-1  |     prefill_logprobs_tensor = torch.log_softmax(out, -1)
tgi-llama3.1-70b-1  | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.93 GiB. GPU 2 has a total capacity of 47.53 GiB of which 1.18 GiB is free. Process 790346 has 46.33 GiB memory in use. Of the allocated memory 45.87 GiB is allocated by PyTorch, and 30.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | The above exception was the direct cause of the following exception:
tgi-llama3.1-70b-1  | 
tgi-llama3.1-70b-1  | Traceback (most recent call last):
tgi-llama3.1-70b-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llama3.1-70b-1  |     sys.exit(app())
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llama3.1-70b-1  |     return get_command(self)(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llama3.1-70b-1  |     return self.main(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llama3.1-70b-1  |     return _main(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llama3.1-70b-1  |     rv = self.invoke(ctx)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llama3.1-70b-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llama3.1-70b-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llama3.1-70b-1  |     return __callback(*args, **kwargs)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llama3.1-70b-1  |     return callback(**use_params)  # type: ignore
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 109, in serve
tgi-llama3.1-70b-1  |     server.serve(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 274, in serve
tgi-llama3.1-70b-1  |     asyncio.run(
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llama3.1-70b-1  |     return loop.run_until_complete(main)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llama3.1-70b-1  |     self.run_forever()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llama3.1-70b-1  |     self._run_once()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llama3.1-70b-1  |     handle._run()
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llama3.1-70b-1  |     self._context.run(self._callback, *self._args)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llama3.1-70b-1  |     return await self.intercept(
tgi-llama3.1-70b-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llama3.1-70b-1  |     return await response
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
tgi-llama3.1-70b-1  |     raise error
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
tgi-llama3.1-70b-1  |     return await behavior(request_or_iterator, context)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Warmup
tgi-llama3.1-70b-1  |     max_supported_total_tokens = self.model.warmup(batch)
tgi-llama3.1-70b-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in warmup
tgi-llama3.1-70b-1  |     raise RuntimeError(
tgi-llama3.1-70b-1  | RuntimeError: Not enough memory to handle 2 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

This is my docker compose file:

services:
  tgi-llama3.1-70b:
#    image: ghcr.io/huggingface/text-generation-inference
    build:
      context: .
      dockerfile: Dockerfile
    restart: always
    shm_size: 64g
    env_file: .env
    environment:
      TRUST_REMOTE_CODE: true
      MODEL_ID: meta-llama/Meta-Llama-3.1-70B-Instruct
      HUGGINGFACE_HUB_CACHE: /data
      MAX_TOTAL_TOKENS: 32768
      MAX_INPUT_TOKENS: 16384
      MAX_STOP_SEQUENCES: 5
      USE_PREFIX_CACHING: true
      FLASH_INFER: true
    volumes:
      - /data/huggingface/hub/:/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [ gpu ]

And the docker file

FROM ghcr.io/huggingface/text-generation-inference

RUN pip install --no-cache-dir flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

ENTRYPOINT ["/tgi-entrypoint.sh"]

freegheist commented 2 months ago

@ErikKaum any plans to look at the OOM issues with large contexts? I still get the OOM (mentioned above) regardless of prefix caching on the latest Docker images it seems.

dacox commented 2 months ago

@ErikKaum @freegheist Yeah, I was evaluating this and trying to do napkin math for gpu memory.

I am unable to run LLama3.1-8b even at 64k on an A100.

This sheet from Meta seems to imply 128k should only take 16GB of VRAM

ErikKaum commented 2 months ago

Hi @freegheist 👋

Sorry for being unclear, so the PR was about prefix caching but we still need the prefix chunking in. We've had some issues with it so it's been a back and forth.

Can't promise when it's in but we're working hard to get it out 🤞

imran3180 commented 2 months ago

Same issue. Commenting for increasing the priority.

osmalpkoras commented 2 months ago

Same issue here.

giladd123 commented 1 month ago

Same issue.

Simon-Stone commented 1 month ago

Same issue

cancelself commented 1 month ago

I can't fit this model with 128K as well, something is not playing nice here. (tested the vLLM with 128K, no problem)

next stop, vllm!

imran3180 commented 1 month ago

@ErikKaum @drbh @Narsil Since many people are running into the same problem. Is there any plan to prioritize this bug?

nimishbongale commented 3 weeks ago

Same issue!

rishu931997 commented 2 weeks ago

Is there a fix planned for this? I'm still unable to increase the context length to more than 40k. Or is there a workaround to increase the context length?

huggingface / text-generation-inference