huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.11k stars 1.07k forks source link

[Volta] [No flash attention] Dependencies missing for running quantized Llama models in docker #2448

Open ladi-pomsar opened 2 months ago

ladi-pomsar commented 2 months ago

System Info

Hi everyone,

i was trying out to run quantized Llama with dockerized TGI and run into issues. First, I tried AWQ with Llama 3.1 70B:

Args {
llm2.internal  |     model_id: "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
llm2.internal  |     revision: None,
llm2.internal  |     validation_workers: 2,
llm2.internal  |     sharded: Some(
llm2.internal  |         false,
llm2.internal  |     ),
llm2.internal  |     num_shard: None,
llm2.internal  |     quantize: Some(
llm2.internal  |         Awq,
llm2.internal  |     ),
llm2.internal  |     speculate: None,
llm2.internal  |     dtype: None,
llm2.internal  |     trust_remote_code: false,
llm2.internal  |     max_concurrent_requests: 128,
llm2.internal  |     max_best_of: 2,
llm2.internal  |     max_stop_sequences: 4,
llm2.internal  |     max_top_n_tokens: 5,
llm2.internal  |     max_input_tokens: Some(
llm2.internal  |         1500,
llm2.internal  |     ),
llm2.internal  |     max_input_length: None,
llm2.internal  |     max_total_tokens: None,
llm2.internal  |     waiting_served_ratio: 0.3,
llm2.internal  |     max_batch_prefill_tokens: None,
llm2.internal  |     max_batch_total_tokens: None,
llm2.internal  |     max_waiting_tokens: 20,
llm2.internal  |     max_batch_size: None,
llm2.internal  |     cuda_graphs: None,
llm2.internal  |     hostname: "llm2.internal",
llm2.internal  |     port: 80,
llm2.internal  |     shard_uds_path: "/tmp/text-generation-server",
llm2.internal  |     master_addr: "localhost",
llm2.internal  |     master_port: 29500,
llm2.internal  |     huggingface_hub_cache: Some(
llm2.internal  |         "/data",
llm2.internal  |     ),
llm2.internal  |     weights_cache_override: None,
llm2.internal  |     disable_custom_kernels: false,
llm2.internal  |     cuda_memory_fraction: 1.0,
llm2.internal  |     rope_scaling: None,
llm2.internal  |     rope_factor: None,
llm2.internal  |     json_output: false,
llm2.internal  |     otlp_endpoint: None,
llm2.internal  |     otlp_service_name: "text-generation-inference.router",
llm2.internal  |     cors_allow_origin: [],
llm2.internal  |     watermark_gamma: None,
llm2.internal  |     watermark_delta: None,
llm2.internal  |     ngrok: false,
llm2.internal  |     ngrok_authtoken: None,
llm2.internal  |     ngrok_edge: None,
llm2.internal  |     tokenizer_config_path: None,
llm2.internal  |     disable_grammar_support: false,
llm2.internal  |     env: false,
llm2.internal  |     max_client_batch_size: 4,
llm2.internal  |     lora_adapters: None,
llm2.internal  |     disable_usage_stats: false,
llm2.internal  |     disable_crash_reports: false,
llm2.internal  | }
llm2.internal  | 2024-08-22T06:21:48.590446Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
llm2.internal  | 2024-08-22T06:22:16.734020Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
llm2.internal  | 2024-08-22T06:22:16.734052Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1550
llm2.internal  | 2024-08-22T06:22:16.734060Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
llm2.internal  | 2024-08-22T06:22:16.734355Z  INFO download: text_generation_launcher: Starting check and download process for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
llm2.internal  | 2024-08-22T06:22:20.671827Z  INFO text_generation_launcher: Download file: model-00001-of-00009.safetensors
llm2.internal  | 2024-08-22T06:24:09.395876Z  INFO text_generation_launcher: Downloaded /data/models--hugging-quants--Meta-Llama-3.1-70B-Instruct-AWQ-INT4/snapshots/2123003760781134cfc31124aa6560a45b491fdf/model-00001-of-00009.safetensors in 0:01:48.
//download
llm2.internal  | 2024-08-22T06:35:35.073558Z  INFO download: text_generation_launcher: Successfully downloaded weights for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
llm2.internal  | 2024-08-22T06:35:35.073886Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
llm2.internal  | 2024-08-22T06:35:38.263459Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.
llm2.internal  | 2024-08-22T06:35:40.764078Z ERROR text_generation_launcher: Error when initializing model
llm2.internal  | Traceback (most recent call last):
llm2.internal  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
llm2.internal  |     sys.exit(app())
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
llm2.internal  |     return get_command(self)(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
llm2.internal  |     return self.main(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
llm2.internal  |     return _main(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
llm2.internal  |     rv = self.invoke(ctx)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
llm2.internal  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
llm2.internal  |     return ctx.invoke(self.callback, **ctx.params)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
llm2.internal  |     return __callback(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
llm2.internal  |     return callback(**use_params)  # type: ignore
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
llm2.internal  |     server.serve(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
llm2.internal  |     asyncio.run(
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
llm2.internal  |     return loop.run_until_complete(main)
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
llm2.internal  |     self.run_forever()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
llm2.internal  |     self._run_once()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
llm2.internal  |     handle._run()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
llm2.internal  |     self._context.run(self._callback, *self._args)
llm2.internal  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
llm2.internal  |     model = get_model(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model
llm2.internal  |     return CausalLM.fallback(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback
llm2.internal  |     model = AutoModelForCausalLM.from_pretrained(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
llm2.internal  |     return model_class.from_pretrained(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3341, in from_pretrained
llm2.internal  |     hf_quantizer.validate_environment(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_awq.py", line 53, in validate_environment
llm2.internal  |     raise ImportError("Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)")
llm2.internal  | ImportError: Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)
llm2.internal  | 2024-08-22T06:35:41.483837Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
llm2.internal  | 
llm2.internal  | 2024-08-22 06:35:36.597 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
llm2.internal  | Traceback (most recent call last):
llm2.internal  | 
llm2.internal  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
llm2.internal  |     sys.exit(app())
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
llm2.internal  |     server.serve(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
llm2.internal  |     asyncio.run(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
llm2.internal  |     return loop.run_until_complete(main)
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
llm2.internal  |     return future.result()
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
llm2.internal  |     model = get_model(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model
llm2.internal  |     return CausalLM.fallback(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback
llm2.internal  |     model = AutoModelForCausalLM.from_pretrained(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
llm2.internal  |     return model_class.from_pretrained(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3341, in from_pretrained
llm2.internal  |     hf_quantizer.validate_environment(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_awq.py", line 53, in validate_environment
llm2.internal  |     raise ImportError("Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)")
llm2.internal  | 
llm2.internal  | ImportError: Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)
llm2.internal  |  rank=0
llm2.internal  | 2024-08-22T06:35:41.580962Z ERROR text_generation_launcher: Shard 0 failed to start
llm2.internal  | 2024-08-22T06:35:41.580977Z  INFO text_generation_launcher: Shutting down shards
llm2.internal  | Error: ShardCannotStart

Second, I tried Llama 3 70B with GPTQ:

llm2.internal  | 2024-08-22T07:50:05.356461Z  INFO text_generation_launcher: Args {
llm2.internal  |     model_id: "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ",
llm2.internal  |     revision: None,
llm2.internal  |     validation_workers: 2,
llm2.internal  |     sharded: Some(
llm2.internal  |         false,
llm2.internal  |     ),
llm2.internal  |     num_shard: None,
llm2.internal  |     quantize: Some(
llm2.internal  |         Gptq,
llm2.internal  |     ),
llm2.internal  |     speculate: None,
llm2.internal  |     dtype: None,
llm2.internal  |     trust_remote_code: false,
llm2.internal  |     max_concurrent_requests: 128,
llm2.internal  |     max_best_of: 2,
llm2.internal  |     max_stop_sequences: 4,
llm2.internal  |     max_top_n_tokens: 5,
llm2.internal  |     max_input_tokens: Some(
llm2.internal  |         1500,
llm2.internal  |     ),
llm2.internal  |     max_input_length: None,
llm2.internal  |     max_total_tokens: None,
llm2.internal  |     waiting_served_ratio: 0.3,
llm2.internal  |     max_batch_prefill_tokens: None,
llm2.internal  |     max_batch_total_tokens: None,
llm2.internal  |     max_waiting_tokens: 20,
llm2.internal  |     max_batch_size: None,
llm2.internal  |     cuda_graphs: None,
llm2.internal  |     hostname: "llm2.internal",
llm2.internal  |     port: 80,
llm2.internal  |     shard_uds_path: "/tmp/text-generation-server",
llm2.internal  |     master_addr: "localhost",
llm2.internal  |     master_port: 29500,
llm2.internal  |     huggingface_hub_cache: Some(
llm2.internal  |         "/data",
llm2.internal  |     ),
llm2.internal  |     weights_cache_override: None,
llm2.internal  |     disable_custom_kernels: false,
llm2.internal  |     cuda_memory_fraction: 1.0,
llm2.internal  |     rope_scaling: None,
llm2.internal  |     rope_factor: None,
llm2.internal  |     json_output: false,
llm2.internal  |     otlp_endpoint: None,
llm2.internal  |     otlp_service_name: "text-generation-inference.router",
llm2.internal  |     cors_allow_origin: [],
llm2.internal  |     watermark_gamma: None,
llm2.internal  |     watermark_delta: None,
llm2.internal  |     ngrok: false,
llm2.internal  |     ngrok_authtoken: None,
llm2.internal  |     ngrok_edge: None,
llm2.internal  |     tokenizer_config_path: None,
llm2.internal  |     disable_grammar_support: false,
llm2.internal  |     env: false,
llm2.internal  |     max_client_batch_size: 4,
llm2.internal  |     lora_adapters: None,
llm2.internal  |     disable_usage_stats: false,
llm2.internal  |     disable_crash_reports: false,
llm2.internal  | }
llm2.internal  | 2024-08-22T07:50:05.356599Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
llm2.internal  | 2024-08-22T07:50:33.497367Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
llm2.internal  | 2024-08-22T07:50:33.497398Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1550
llm2.internal  | 2024-08-22T07:50:33.497406Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
llm2.internal  | 2024-08-22T07:50:33.497694Z  INFO download: text_generation_launcher: Starting check and download process for MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ
llm2.internal  | 2024-08-22T07:50:37.443461Z  INFO text_generation_launcher: Download file: model.safetensors
llm2.internal  | 2024-08-22T07:57:59.818601Z  INFO text_generation_launcher: Downloaded /data/models--MaziyarPanahi--Meta-Llama-3-70B-Instruct-GPTQ/snapshots/46c7afccd4f9345a3d43c1468fde1034cf0a0932/model.safetensors in 0:07:22.
llm2.internal  | 2024-08-22T07:57:59.818719Z  INFO text_generation_launcher: Download: [1/1] -- ETA: 0
llm2.internal  | 2024-08-22T07:58:00.452718Z  INFO download: text_generation_launcher: Successfully downloaded weights for MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ
llm2.internal  | 2024-08-22T07:58:00.453007Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
llm2.internal  | 2024-08-22T07:58:03.658769Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.
llm2.internal  | 2024-08-22T07:58:06.383846Z ERROR text_generation_launcher: Error when initializing model
llm2.internal  | Traceback (most recent call last):
llm2.internal  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
llm2.internal  |     sys.exit(app())
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
llm2.internal  |     return get_command(self)(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
llm2.internal  |     return self.main(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
llm2.internal  |     return _main(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
llm2.internal  |     rv = self.invoke(ctx)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
llm2.internal  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
llm2.internal  |     return ctx.invoke(self.callback, **ctx.params)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
llm2.internal  |     return __callback(*args, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
llm2.internal  |     return callback(**use_params)  # type: ignore
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
llm2.internal  |     server.serve(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
llm2.internal  |     asyncio.run(
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
llm2.internal  |     return loop.run_until_complete(main)
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
llm2.internal  |     self.run_forever()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
llm2.internal  |     self._run_once()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
llm2.internal  |     handle._run()
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
llm2.internal  |     self._context.run(self._callback, *self._args)
llm2.internal  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
llm2.internal  |     model = get_model(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model
llm2.internal  |     return CausalLM.fallback(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback
llm2.internal  |     model = AutoModelForCausalLM.from_pretrained(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
llm2.internal  |     return model_class.from_pretrained(
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3336, in from_pretrained
llm2.internal  |     hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py", line 136, in from_config
llm2.internal  |     return target_cls(quantization_config, **kwargs)
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_gptq.py", line 47, in __init__
llm2.internal  |     from optimum.gptq import GPTQQuantizer
llm2.internal  | ModuleNotFoundError: No module named 'optimum'
llm2.internal  | 2024-08-22T07:58:07.062648Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
llm2.internal  | 
llm2.internal  | 2024-08-22 07:58:01.988 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
llm2.internal  | Traceback (most recent call last):
llm2.internal  | 
llm2.internal  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
llm2.internal  |     sys.exit(app())
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
llm2.internal  |     server.serve(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
llm2.internal  |     asyncio.run(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
llm2.internal  |     return loop.run_until_complete(main)
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
llm2.internal  |     return future.result()
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
llm2.internal  |     model = get_model(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model
llm2.internal  |     return CausalLM.fallback(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback
llm2.internal  |     model = AutoModelForCausalLM.from_pretrained(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
llm2.internal  |     return model_class.from_pretrained(
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3336, in from_pretrained
llm2.internal  |     hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized)
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py", line 136, in from_config
llm2.internal  |     return target_cls(quantization_config, **kwargs)
llm2.internal  | 
llm2.internal  |   File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_gptq.py", line 47, in __init__
llm2.internal  |     from optimum.gptq import GPTQQuantizer
llm2.internal  | 
llm2.internal  | ModuleNotFoundError: No module named 'optimum'
llm2.internal  |  rank=0
llm2.internal  | 2024-08-22T07:58:07.161187Z ERROR text_generation_launcher: Shard 0 failed to start
llm2.internal  | 2024-08-22T07:58:07.161212Z  INFO text_generation_launcher: Shutting down shards
llm2.internal  | Error: ShardCannotStart

I am running docker container from within docker compose with params, for awq i obviously changed the container and command.

  llm2:
    container_name: llm2.internal
    hostname: llm2.internal
    profiles:
      - common
    image: ghcr.io/huggingface/text-generation-inference:2.2.0
    command: --model-id MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ
    volumes:
      - /home/llm_data:/data
    ports:
      - "2450:80"
    environment:
      - HF_HUB_ENABLE_HF_TRANSFER="false"
      - USE_FLASH_ATTENTION=False
      - MAX_INPUT_TOKENS=1500
      - SHARDED=false
      - HF_TOKEN=
      - QUANTIZE=gptq
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["4","5","6","7"]
              capabilities: [gpu]
    networks:
      - container-network

OS: Ubuntu 22.04.4 LTS Rust version: N/A Container version: 2.2.0 - sha256:d39d513f13727ffa9b6a4d0e949f36413b944aabc9a236c0aa2986c929906769 Model being used: Llama 3.1 and 3.0 GPUs: 4x Volta V100 - hence disabled Flash attention

Information

Tasks

Reproduction

  1. Run official container with either of the models

Expected behavior

Quantized Llamas should run in docker container.

ladi-pomsar commented 2 months ago

Did double check, this issue is indeed caused by the lack of flash attention support on V100s. No such problem on Ada generation, but once you turn flash attention off, it starts to happen there as well.