Open Ansh-Sarkar opened 5 months ago
Hey! I confirm I get the same error when using ghcr.io/huggingface/text-generation-inference
.
However, you don't get this in the latest version: ghcr.io/huggingface/text-generation-inference:2.0.4
.
Could you try and change the docker image here and let me know if it fixes your problem?
Sure thing. Trying this out.
I got the same error while using TGI to do inference for Qwen-7B-Instruct. I've tried to use the latest version of image, but didn't work. My GPU is 2080Ti(22GB memory), is this harware unsupported by TGI? traceback: hardware:
Hey! I confirm I get the same error when using
ghcr.io/huggingface/text-generation-inference
.However, you don't get this in the latest version:
ghcr.io/huggingface/text-generation-inference:2.0.4
.Could you try and change the docker image here and let me know if it fixes your problem?
Hi ! Just a quick update. This worked for me. Triton is now working perfectly and serving models. Thanks a ton.
@robocanic, is it possible for you to share the entirety of what's happening in the terminal through text? It should be easier for us to help you then.
@LysandreJik I had the same error with 2080Ti(22GB memory), is there any solution? Command:
docker run --net=host --gpus all --shm-size 1g -e HF_HUB_DISABLE_PROGRESS_BARS=1 -e HF_HUB_ENABLE_HF_TRANSFER=0 -p 0.0.0.0:8080:80 -v data:/data ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id Qwen/Qwen2-7B-Instruct-GPTQ-Int4 --quantize gptq
Output:
Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:2.1.1
WARNING: Published ports are discarded when using host network mode
2024-07-08T01:57:47.607929Z INFO text_generation_launcher: Args {
model_id: "Qwen/Qwen2-7B-Instruct-GPTQ-Int4",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Gptq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "docker-desktop",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
}
2024-07-08T01:57:47.608122Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-07-08T01:57:48.209343Z INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-07-08T01:57:48.209424Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-07-08T01:57:48.209435Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-07-08T01:57:48.209442Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-07-08T01:57:48.209450Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-07-08T01:57:48.209723Z INFO download: text_generation_launcher: Starting check and download process for Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:57:50.257291Z INFO text_generation_launcher: Detected system cuda
2024-07-08T01:57:52.437100Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-07-08T01:57:53.114906Z INFO download: text_generation_launcher: Successfully downloaded weights for Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:57:53.115273Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-08T01:57:54.994230Z INFO text_generation_launcher: Detected system cuda
2024-07-08T01:58:03.134759Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-08T01:58:07.971388Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-07-08T01:58:08.041215Z INFO shard-manager: text_generation_launcher: Shard ready in 14.924702177s rank=0
2024-07-08T01:58:08.137082Z INFO text_generation_launcher: Starting Webserver
2024-07-08T01:58:08.262402Z INFO text_generation_router: router/src/main.rs:217: Using the Hugging Face API
2024-07-08T01:58:08.262460Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-08T01:58:09.230758Z INFO text_generation_router: router/src/main.rs:493: Serving revision a087887257f1d8f5268b0b055474cc4ce4601e6e of model Qwen/Qwen2-7B-Instruct-GPTQ-Int4
2024-07-08T01:58:09.468466Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '151643' but was given ID 'None'
2024-07-08T01:58:09.468512Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_start|>' was expected to have ID '151644' but was given ID 'None'
2024-07-08T01:58:09.468516Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_end|>' was expected to have ID '151645' but was given ID 'None'
2024-07-08T01:58:09.469758Z INFO text_generation_router: router/src/main.rs:345: Using config Some(Qwen2)
2024-07-08T01:58:09.469797Z WARN text_generation_router: router/src/main.rs:372: Invalid hostname, defaulting to 0.0.0.0
2024-07-08T01:58:09.474463Z INFO text_generation_router::server: router/src/server.rs:1567: Warming up model
2024-07-08T01:58:17.738892Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 985, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1178, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 239, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 140, in forward
attention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py", line 211, in attention
return flash_attn_2_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
2024-07-08T01:58:17.911600Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-07-08T01:58:17.971924Z ERROR text_generation_launcher: Webserver Crashed
2024-07-08T01:58:17.971992Z INFO text_generation_launcher: Shutting down shards
2024-07-08T01:58:18.057239Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-07-08T01:58:18.058052Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-07-08T01:58:18.258889Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
@robocanic, is it possible for you to share the entirety of what's happening in the terminal through text? It should be easier for us to help you then.
Here is the full log file: llm-inference-0.log
@LysandreJik I am getting the same error with 2.1.0 but not with 2.0.3 (running on T4) is there any way to disable flash attention?
Set an env variable false:
-e USE_FLASH_ATTENTION=False
It seems like we should not default to using flash attention in that case; I thought this was already fixed, maybe something to investigate if you have the bandwidth @ErikKaum
@RonanKMcGovern I tried to set the env variable "USE_FLASH_ATTENTION=False", but got the error like below:
[2m2024-07-16T06:55:05.209523Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Args {
model_id: "/data/qwen/Qwen2-7B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: Some(
Float16,
),
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: Some(
4096,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "llm-inference-0",
port: 8080,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
[2m2024-07-16T06:55:05.209694Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Default `max_input_tokens` to 4095
[2m2024-07-16T06:55:05.209706Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Default `max_batch_prefill_tokens` to 4145
[2m2024-07-16T06:55:05.209713Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Using default cuda graphs [1, 2, 4, 8, 16, 32]
[2m2024-07-16T06:55:05.210030Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Starting download process.
[2m2024-07-16T06:55:07.478729Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.
[2m2024-07-16T06:55:07.857357Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights.
[2m2024-07-16T06:55:07.857850Z [0m [32m INFO [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Starting shard [2m [3mrank [0m [2m= [0m0 [0m
[2m2024-07-16T06:55:10.107172Z [0m [33m WARN [0m [2mtext_generation_launcher [0m [2m: [0m Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.
[2m2024-07-16T06:55:10.762898Z [0m [31mERROR [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in <module>
from text_generation_server.models.pali_gemma import PaliGemmaBatch
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in <module>
from text_generation_server.models.vlm_causal_lm import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
from text_generation_server.models.flash_mistral import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in <module>
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
from text_generation_server.utils import paged_attention, flash_attn
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in <module>
raise ImportError("`USE_FLASH_ATTENTION` is false.")
ImportError: `USE_FLASH_ATTENTION` is false.
[2m [3mrank [0m [2m= [0m0 [0m
[2m2024-07-16T06:55:10.860120Z [0m [31mERROR [0m [2mtext_generation_launcher [0m [2m: [0m Shard 0 failed to start
[2m2024-07-16T06:55:10.860129Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Shutting down shards
Error: ShardCannotStart
Seems like there is a problem. Can you provide a full and simple replicator for the issue. That will help the HF team address this.
On Tue, Jul 16, 2024 at 8:00 AM robb @.***> wrote:
@RonanKMcGovern https://github.com/RonanKMcGovern I tried to set the env variable "USE_FLASH_ATTENTION=False", but got the error like below:
[2m2024-07-16T06:55:05.209523Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Args { model_id: "/data/qwen/Qwen2-7B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: Some( Float16, ), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: Some( 4096, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "llm-inference-0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, } [2m2024-07-16T06:55:05.209694Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Default
max_input_tokens
to 4095 [2m2024-07-16T06:55:05.209706Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Defaultmax_batch_prefill_tokens
to 4145 [2m2024-07-16T06:55:05.209713Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Using default cuda graphs [1, 2, 4, 8, 16, 32] [2m2024-07-16T06:55:05.210030Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Starting download process. [2m2024-07-16T06:55:07.478729Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Files are already present on the host. Skipping download.[2m2024-07-16T06:55:07.857357Z [0m [32m INFO [0m [1mdownload [0m: [2mtext_generation_launcher [0m [2m: [0m Successfully downloaded weights. [2m2024-07-16T06:55:07.857850Z [0m [32m INFO [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Starting shard [2m [3mrank [0m [2m= [0m0 [0m [2m2024-07-16T06:55:10.107172Z [0m [33m WARN [0m [2mtext_generation_launcher [0m [2m: [0m Could not import Flash Attention enabled models:
USE_FLASH_ATTENTION
is false.[2m2024-07-16T06:55:10.762898Z [0m [31mERROR [0m [1mshard-manager [0m: [2mtext_generation_launcher [0m [2m: [0m Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve from text_generation_server import server
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in
from text_generation_server.models.pali_gemma import PaliGemmaBatch File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in
from text_generation_server.models.vlm_causal_lm import ( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in
from text_generation_server.models.flash_mistral import ( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in
from text_generation_server.models.custom_modeling.flash_mistral_modeling import ( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in
from text_generation_server.utils import paged_attention, flash_attn File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in
raise ImportError(" USE_FLASH_ATTENTION
is false.")ImportError:
USE_FLASH_ATTENTION
is false. [2m [3mrank [0m [2m= [0m0 [0m [2m2024-07-16T06:55:10.860120Z [0m [31mERROR [0m [2mtext_generation_launcher [0m [2m: [0m Shard 0 failed to start [2m2024-07-16T06:55:10.860129Z [0m [32m INFO [0m [2mtext_generation_launcher [0m [2m: [0m Shutting down shards Error: ShardCannotStart— Reply to this email directly, view it on GitHub https://github.com/huggingface/text-generation-inference/issues/2037#issuecomment-2230164734, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CSEFVDDNQJ4JJOUB5DZMTAIFAVCNFSM6AAAAABI6PIT2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE3DINZTGQ . You are receiving this because you were mentioned.Message ID: @.***>
This is also happening with Gemma 2-2b-it when trying to deploy it on Inference Endpoints
Hi @ytjhai 👋
Thanks for bringing this up. Could you specify a bit more what configuration you're using on the Inference Endpoints? E.g. which version, what is the instance type and so on.
If you don't want to disclose the info in public we can also continue the debugging in private 👍
Hi @ytjhai 👋
Thanks for bringing this up. Could you specify a bit more what configuration you're using on the Inference Endpoints? E.g. which version, what is the instance type and so on.
If you don't want to disclose the info in public we can also continue the debugging in private 👍
Sure, I'm using the google/gemma-2-2b-it repository with a 16GB VRAM NVIDIA T4. I was expecting it to be plug and play but that didn't work. Then I also tried this setup from the llama repository and while it worked for Llama 3.1, it didn't work for Gemma 2. Then I also tried to disable flash attention with different values for the ATTENTION
env variable but that didn't work either. Most of my efforts have been with the v2.2.0 text-generation-inference container.
Edit: It's worth mentioning that after messing around with the environment variables, I was able to deploy small / medium versions of most of the popular open-source models on inference endpoints, including Qwen2 (very much plug-and-play), Yi (also very much plug and play), Phi-3-mini (needed to set TRUST_REMOTE_CODE=true), and Llama 3.1 (following the directions in the above link). But Gemma 2 just doesn't want to deploy and I'd rather not over provision the hardware needed for it.
Gotcha, sorry for the confusion here. I think this is a deeper issue with how Gemma 2 works and unfortunately our recommendations aren't up to date.
Long story short, Gemma 2 doesn't run on T4 since it requires Flash Attention 2 for the sliding window and softcapping. I also think passing in things like -e USE_FLASH_ATTENTION=False
won't work since the model explicitly requires it, I think.
Would it be possible to try on a different instance? If I remember correctly I ran it on an instance with an A10 and it worked without a problem 👍
@ErikKaum Ok thanks for the clarification! I didn't realize that Gemma 2 required Flash Attention 2 for inference. I was running a GGUF quantization locally that seemed fine, so I assumed there wasn't additional magic involved.
@ErikKaum , Hope you are doing well. I am trying to deploy starcoder2-3B (pretrained and finetuned) model to a T4 instance in AWS (16GB NVIDIA T4 Tensor Core). while deploying I am getting the same error as mentioned above. Now I am doing the deployment through AWS Sagemaker notebook by using HuggingFaceModel.deploy method for this. My hypothesis is that this method by default uses the Flash Attention for deployment (please correct me if I am wrong).
Really appreciate if someone could guide me if there is any method by which we can disable the Flash Attention usage.
Thanks, Ashwin
System Info
Command Causing Issue:
OS Version:
Ubuntu 22.04.4 LTS
Rust Version:cargo 1.78.0 (54d8815d0 2024-03-26)
Model Being Used:microsoft/Phi-3-mini-4k-instruct
(as per the link for Phi 3 given in the docs)Hardware Used:
Current Version:
Latest Docker Image
Information
Tasks
Reproduction
Trying to run a
Phi 3
model using TGI on my setup. Was running a script namedtgi-llm.sh
with the following contentsThe model downloads smoothly but at the very end an error is generated as show below
Expected behavior
Expecting smooth functioning of the server enabling consumption of API for testing, development and benchmarking of LLMs. Open to potential solutions or work arounds.