huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.08k stars 1.07k forks source link

RuntimeError: FlashAttention only supports Ampere GPUs or newer. #2037

Open Ansh-Sarkar opened 5 months ago

Ansh-Sarkar commented 5 months ago

System Info

Command Causing Issue:

model=microsoft/Phi-3-mini-4k-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 8g -p 8017:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference \
    --model-id $model

OS Version: Ubuntu 22.04.4 LTS Rust Version: cargo 1.78.0 (54d8815d0 2024-03-26) Model Being Used: microsoft/Phi-3-mini-4k-instruct (as per the link for Phi 3 given in the docs)

Hardware Used:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000001:00:00.0 Off |                  Off |
| N/A   40C    P0             25W /   70W |    2093MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Current Version: Latest Docker Image

Information

Tasks

Reproduction

Trying to run a Phi 3 model using TGI on my setup. Was running a script named tgi-llm.sh with the following contents

model=microsoft/Phi-3-mini-4k-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 8g -p 8017:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference \
    --model-id $model

The model downloads smoothly but at the very end an error is generated as show below

2024-06-07T10:54:55.939456Z ERROR text_generation_launcher: Method Warmup encountered an error.

<Traceback Details . . .>

RuntimeError: FlashAttention only supports Ampere GPUs or newer.
2024-06-07T10:54:56.047988Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-06-07T10:54:56.066775Z ERROR text_generation_launcher: Webserver Crashed
2024-06-07T10:54:56.066792Z  INFO text_generation_launcher: Shutting down shards
2024-06-07T10:54:56.066830Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-06-07T10:54:56.066993Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-06-07T10:54:56.367280Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0

Expected behavior

Expecting smooth functioning of the server enabling consumption of API for testing, development and benchmarking of LLMs. Open to potential solutions or work arounds.

LysandreJik commented 5 months ago

Hey! I confirm I get the same error when using ghcr.io/huggingface/text-generation-inference.

However, you don't get this in the latest version: ghcr.io/huggingface/text-generation-inference:2.0.4.

Could you try and change the docker image here and let me know if it fixes your problem?

Ansh-Sarkar commented 5 months ago

Sure thing. Trying this out.

robocanic commented 4 months ago

I got the same error while using TGI to do inference for Qwen-7B-Instruct. I've tried to use the latest version of image, but didn't work. My GPU is 2080Ti(22GB memory), is this harware unsupported by TGI? traceback: image hardware: image

Ansh-Sarkar commented 4 months ago

Hey! I confirm I get the same error when using ghcr.io/huggingface/text-generation-inference.

However, you don't get this in the latest version: ghcr.io/huggingface/text-generation-inference:2.0.4.

Could you try and change the docker image here and let me know if it fixes your problem?

Hi ! Just a quick update. This worked for me. Triton is now working perfectly and serving models. Thanks a ton.

LysandreJik commented 4 months ago

@robocanic, is it possible for you to share the entirety of what's happening in the terminal through text? It should be easier for us to help you then.

CrazyboyQCD commented 4 months ago

@LysandreJik I had the same error with 2080Ti(22GB memory), is there any solution? Command:

docker run --net=host --gpus all --shm-size 1g -e HF_HUB_DISABLE_PROGRESS_BARS=1 -e HF_HUB_ENABLE_HF_TRANSFER=0 -p 0.0.0.0:8080:80 -v data:/data ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --quantize gptq

Output:

Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:2.1.1
WARNING: Published ports are discarded when using host network mode
2024-07-08T01:57:47.607929Z  INFO text_generation_launcher: Args {
    model_id: "Qwen/Qwen2-7B-Instruct-GPTQ-Int4",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        Gptq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "docker-desktop",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-07-08T01:57:47.608122Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-07-08T01:57:48.209343Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-07-08T01:57:48.209424Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-07-08T01:57:48.209435Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-07-08T01:57:48.209442Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-07-08T01:57:48.209450Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-07-08T01:57:48.209723Z  INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:57:50.257291Z  INFO text_generation_launcher: Detected system cuda
2024-07-08T01:57:52.437100Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-07-08T01:57:53.114906Z  INFO download: text_generation_launcher: Successfully downloaded weights for Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:57:53.115273Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-08T01:57:54.994230Z  INFO text_generation_launcher: Detected system cuda
2024-07-08T01:58:03.134759Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-08T01:58:07.971388Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-07-08T01:58:08.041215Z  INFO shard-manager: text_generation_launcher: Shard ready in 14.924702177s rank=0
2024-07-08T01:58:08.137082Z  INFO text_generation_launcher: Starting Webserver
2024-07-08T01:58:08.262402Z  INFO text_generation_router: router/src/main.rs:217: Using the Hugging Face API
2024-07-08T01:58:08.262460Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-08T01:58:09.230758Z  INFO text_generation_router: router/src/main.rs:493: Serving revision a087887257f1d8f5268b0b055474cc4ce4601e6e of model Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:58:09.468466Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '151643' but was given ID 'None'
2024-07-08T01:58:09.468512Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_start|>' was expected to have ID '151644' but was given ID 'None'
2024-07-08T01:58:09.468516Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_end|>' was expected to have ID '151645' but was given ID 'None'
2024-07-08T01:58:09.469758Z  INFO text_generation_router: router/src/main.rs:345: Using config Some(Qwen2)
2024-07-08T01:58:09.469797Z  WARN text_generation_router: router/src/main.rs:372: Invalid hostname, defaulting to 0.0.0.0
2024-07-08T01:58:09.474463Z  INFO text_generation_router::server: router/src/server.rs:1567: Warming up model
2024-07-08T01:58:17.738892Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 985, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1178, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 239, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 140, in forward
    attention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py", line 211, in attention
    return flash_attn_2_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
2024-07-08T01:58:17.911600Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-07-08T01:58:17.971924Z ERROR text_generation_launcher: Webserver Crashed
2024-07-08T01:58:17.971992Z  INFO text_generation_launcher: Shutting down shards
2024-07-08T01:58:18.057239Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-07-08T01:58:18.058052Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-07-08T01:58:18.258889Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
robocanic commented 4 months ago

@robocanic, is it possible for you to share the entirety of what's happening in the terminal through text? It should be easier for us to help you then.

Here is the full log file: llm-inference-0.log

jegork commented 4 months ago

@LysandreJik I am getting the same error with 2.1.0 but not with 2.0.3 (running on T4) is there any way to disable flash attention?

RonanKMcGovern commented 4 months ago

Set an env variable false:

-e USE_FLASH_ATTENTION=False 
LysandreJik commented 4 months ago

It seems like we should not default to using flash attention in that case; I thought this was already fixed, maybe something to investigate if you have the bandwidth @ErikKaum

robocanic commented 4 months ago

@RonanKMcGovern I tried to set the env variable "USE_FLASH_ATTENTION=False", but got the error like below:

 [2m2024-07-16T06:55:05.209523Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Args {
    model_id: "/data/qwen/Qwen2-7B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: Some(
        Float16,
    ),
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: Some(
        4096,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "llm-inference-0",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
 [2m2024-07-16T06:55:05.209694Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Default `max_input_tokens` to 4095
 [2m2024-07-16T06:55:05.209706Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Default `max_batch_prefill_tokens` to 4145
 [2m2024-07-16T06:55:05.209713Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Using default cuda graphs [1, 2, 4, 8, 16, 32]
 [2m2024-07-16T06:55:05.210030Z [0m  [32m INFO [0m  [1mdownload [0m:  [2mtext_generation_launcher [0m [2m: [0m Starting download process.
 [2m2024-07-16T06:55:07.478729Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.

 [2m2024-07-16T06:55:07.857357Z [0m  [32m INFO [0m  [1mdownload [0m:  [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights.
 [2m2024-07-16T06:55:07.857850Z [0m  [32m INFO [0m  [1mshard-manager [0m:  [2mtext_generation_launcher [0m [2m: [0m Starting shard  [2m [3mrank [0m [2m= [0m0 [0m
 [2m2024-07-16T06:55:10.107172Z [0m  [33m WARN [0m  [2mtext_generation_launcher [0m [2m: [0m Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.

 [2m2024-07-16T06:55:10.762898Z [0m  [31mERROR [0m  [1mshard-manager [0m:  [2mtext_generation_launcher [0m [2m: [0m Shard complete standard error output:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in <module>
    from text_generation_server.models.pali_gemma import PaliGemmaBatch

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in <module>
    from text_generation_server.models.vlm_causal_lm import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils import paged_attention, flash_attn

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in <module>
    raise ImportError("`USE_FLASH_ATTENTION` is false.")

ImportError: `USE_FLASH_ATTENTION` is false.
  [2m [3mrank [0m [2m= [0m0 [0m
 [2m2024-07-16T06:55:10.860120Z [0m  [31mERROR [0m  [2mtext_generation_launcher [0m [2m: [0m Shard 0 failed to start
 [2m2024-07-16T06:55:10.860129Z [0m  [32m INFO [0m  [2mtext_generation_launcher [0m [2m: [0m Shutting down shards
Error: ShardCannotStart
RonanKMcGovern commented 4 months ago

Seems like there is a problem. Can you provide a full and simple replicator for the issue. That will help the HF team address this.

On Tue, Jul 16, 2024 at 8:00 AM robb @.***> wrote:

@RonanKMcGovern https://github.com/RonanKMcGovern I tried to set the env variable "USE_FLASH_ATTENTION=False", but got the error like below:

[2m2024-07-16T06:55:05.209523Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Args { model_id: "/data/qwen/Qwen2-7B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: Some( Float16, ), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "llm-inference-0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, } [2m2024-07-16T06:55:05.209694Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Default max_input_tokens to 4095 [2m2024-07-16T06:55:05.209706Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Default max_batch_prefill_tokens to 4145 [2m2024-07-16T06:55:05.209713Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Using default cuda graphs [1, 2, 4, 8, 16, 32] [2m2024-07-16T06:55:05.210030Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Starting download process. [2m2024-07-16T06:55:07.478729Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.

[2m2024-07-16T06:55:07.857357Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights. [2m2024-07-16T06:55:07.857850Z [0m [32m INFO [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Starting shard [2m [3mrank [0m [2m= [0m0 [0m [2m2024-07-16T06:55:10.107172Z [0m [33m WARN [0m [2mtext_generation_launcher [0m [2m: [0m Could not import Flash Attention enabled models: USE_FLASH_ATTENTION is false.

[2m2024-07-16T06:55:10.762898Z [0m [31mERROR [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve from text_generation_server import server

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in from text_generation_server.models.pali_gemma import PaliGemmaBatch

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in from text_generation_server.models.vlm_causal_lm import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in from text_generation_server.models.flash_mistral import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in from text_generation_server.utils import paged_attention, flash_attn

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in raise ImportError("USE_FLASH_ATTENTION is false.")

ImportError: USE_FLASH_ATTENTION is false. [2m [3mrank [0m [2m= [0m0 [0m [2m2024-07-16T06:55:10.860120Z [0m [31mERROR [0m [2mtext_generation_launcher [0m [2m: [0m Shard 0 failed to start [2m2024-07-16T06:55:10.860129Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Shutting down shards Error: ShardCannotStart

— Reply to this email directly, view it on GitHub https://github.com/huggingface/text-generation-inference/issues/2037#issuecomment-2230164734, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CSEFVDDNQJ4JJOUB5DZMTAIFAVCNFSM6AAAAABI6PIT2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE3DINZTGQ . You are receiving this because you were mentioned.Message ID: @.***>

ytjhai commented 3 months ago

This is also happening with Gemma 2-2b-it when trying to deploy it on Inference Endpoints

ErikKaum commented 3 months ago

Hi @ytjhai 👋

Thanks for bringing this up. Could you specify a bit more what configuration you're using on the Inference Endpoints? E.g. which version, what is the instance type and so on.

If you don't want to disclose the info in public we can also continue the debugging in private 👍

ytjhai commented 3 months ago

Hi @ytjhai 👋

Thanks for bringing this up. Could you specify a bit more what configuration you're using on the Inference Endpoints? E.g. which version, what is the instance type and so on.

If you don't want to disclose the info in public we can also continue the debugging in private 👍

Sure, I'm using the google/gemma-2-2b-it repository with a 16GB VRAM NVIDIA T4. I was expecting it to be plug and play but that didn't work. Then I also tried this setup from the llama repository and while it worked for Llama 3.1, it didn't work for Gemma 2. Then I also tried to disable flash attention with different values for the ATTENTION env variable but that didn't work either. Most of my efforts have been with the v2.2.0 text-generation-inference container.

Edit: It's worth mentioning that after messing around with the environment variables, I was able to deploy small / medium versions of most of the popular open-source models on inference endpoints, including Qwen2 (very much plug-and-play), Yi (also very much plug and play), Phi-3-mini (needed to set TRUST_REMOTE_CODE=true), and Llama 3.1 (following the directions in the above link). But Gemma 2 just doesn't want to deploy and I'd rather not over provision the hardware needed for it.

ErikKaum commented 3 months ago

Gotcha, sorry for the confusion here. I think this is a deeper issue with how Gemma 2 works and unfortunately our recommendations aren't up to date.

Long story short, Gemma 2 doesn't run on T4 since it requires Flash Attention 2 for the sliding window and softcapping. I also think passing in things like -e USE_FLASH_ATTENTION=False won't work since the model explicitly requires it, I think.

Would it be possible to try on a different instance? If I remember correctly I ran it on an instance with an A10 and it worked without a problem 👍

ytjhai commented 3 months ago

@ErikKaum Ok thanks for the clarification! I didn't realize that Gemma 2 required Flash Attention 2 for inference. I was running a GGUF quantization locally that seemed fine, so I assumed there wasn't additional magic involved.

ashwincv0112 commented 1 month ago

@ErikKaum , Hope you are doing well. I am trying to deploy starcoder2-3B (pretrained and finetuned) model to a T4 instance in AWS (16GB NVIDIA T4 Tensor Core). while deploying I am getting the same error as mentioned above. Now I am doing the deployment through AWS Sagemaker notebook by using HuggingFaceModel.deploy method for this. My hypothesis is that this method by default uses the Flash Attention for deployment (please correct me if I am wrong).

Really appreciate if someone could guide me if there is any method by which we can disable the Flash Attention usage.

Thanks, Ashwin