Open ladi-pomsar opened 2 months ago
Hi everyone,
i was trying out to run quantized Llama with dockerized TGI and run into issues. First, I tried AWQ with Llama 3.1 70B:
Args { llm2.internal | model_id: "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", llm2.internal | revision: None, llm2.internal | validation_workers: 2, llm2.internal | sharded: Some( llm2.internal | false, llm2.internal | ), llm2.internal | num_shard: None, llm2.internal | quantize: Some( llm2.internal | Awq, llm2.internal | ), llm2.internal | speculate: None, llm2.internal | dtype: None, llm2.internal | trust_remote_code: false, llm2.internal | max_concurrent_requests: 128, llm2.internal | max_best_of: 2, llm2.internal | max_stop_sequences: 4, llm2.internal | max_top_n_tokens: 5, llm2.internal | max_input_tokens: Some( llm2.internal | 1500, llm2.internal | ), llm2.internal | max_input_length: None, llm2.internal | max_total_tokens: None, llm2.internal | waiting_served_ratio: 0.3, llm2.internal | max_batch_prefill_tokens: None, llm2.internal | max_batch_total_tokens: None, llm2.internal | max_waiting_tokens: 20, llm2.internal | max_batch_size: None, llm2.internal | cuda_graphs: None, llm2.internal | hostname: "llm2.internal", llm2.internal | port: 80, llm2.internal | shard_uds_path: "/tmp/text-generation-server", llm2.internal | master_addr: "localhost", llm2.internal | master_port: 29500, llm2.internal | huggingface_hub_cache: Some( llm2.internal | "/data", llm2.internal | ), llm2.internal | weights_cache_override: None, llm2.internal | disable_custom_kernels: false, llm2.internal | cuda_memory_fraction: 1.0, llm2.internal | rope_scaling: None, llm2.internal | rope_factor: None, llm2.internal | json_output: false, llm2.internal | otlp_endpoint: None, llm2.internal | otlp_service_name: "text-generation-inference.router", llm2.internal | cors_allow_origin: [], llm2.internal | watermark_gamma: None, llm2.internal | watermark_delta: None, llm2.internal | ngrok: false, llm2.internal | ngrok_authtoken: None, llm2.internal | ngrok_edge: None, llm2.internal | tokenizer_config_path: None, llm2.internal | disable_grammar_support: false, llm2.internal | env: false, llm2.internal | max_client_batch_size: 4, llm2.internal | lora_adapters: None, llm2.internal | disable_usage_stats: false, llm2.internal | disable_crash_reports: false, llm2.internal | } llm2.internal | 2024-08-22T06:21:48.590446Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token" llm2.internal | 2024-08-22T06:22:16.734020Z INFO text_generation_launcher: Default `max_total_tokens` to 4096 llm2.internal | 2024-08-22T06:22:16.734052Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1550 llm2.internal | 2024-08-22T06:22:16.734060Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] llm2.internal | 2024-08-22T06:22:16.734355Z INFO download: text_generation_launcher: Starting check and download process for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 llm2.internal | 2024-08-22T06:22:20.671827Z INFO text_generation_launcher: Download file: model-00001-of-00009.safetensors llm2.internal | 2024-08-22T06:24:09.395876Z INFO text_generation_launcher: Downloaded /data/models--hugging-quants--Meta-Llama-3.1-70B-Instruct-AWQ-INT4/snapshots/2123003760781134cfc31124aa6560a45b491fdf/model-00001-of-00009.safetensors in 0:01:48. //download llm2.internal | 2024-08-22T06:35:35.073558Z INFO download: text_generation_launcher: Successfully downloaded weights for hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 llm2.internal | 2024-08-22T06:35:35.073886Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 llm2.internal | 2024-08-22T06:35:38.263459Z WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false. llm2.internal | 2024-08-22T06:35:40.764078Z ERROR text_generation_launcher: Error when initializing model llm2.internal | Traceback (most recent call last): llm2.internal | File "/opt/conda/bin/text-generation-server", line 8, in <module> llm2.internal | sys.exit(app()) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__ llm2.internal | return get_command(self)(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ llm2.internal | return self.main(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main llm2.internal | return _main( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main llm2.internal | rv = self.invoke(ctx) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke llm2.internal | return _process_result(sub_ctx.command.invoke(sub_ctx)) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke llm2.internal | return ctx.invoke(self.callback, **ctx.params) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke llm2.internal | return __callback(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper llm2.internal | return callback(**use_params) # type: ignore llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve llm2.internal | server.serve( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve llm2.internal | asyncio.run( llm2.internal | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run llm2.internal | return loop.run_until_complete(main) llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete llm2.internal | self.run_forever() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever llm2.internal | self._run_once() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once llm2.internal | handle._run() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run llm2.internal | self._context.run(self._callback, *self._args) llm2.internal | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner llm2.internal | model = get_model( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model llm2.internal | return CausalLM.fallback( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback llm2.internal | model = AutoModelForCausalLM.from_pretrained( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained llm2.internal | return model_class.from_pretrained( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3341, in from_pretrained llm2.internal | hf_quantizer.validate_environment( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_awq.py", line 53, in validate_environment llm2.internal | raise ImportError("Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)") llm2.internal | ImportError: Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`) llm2.internal | 2024-08-22T06:35:41.483837Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: llm2.internal | llm2.internal | 2024-08-22 06:35:36.597 | INFO | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda llm2.internal | Traceback (most recent call last): llm2.internal | llm2.internal | File "/opt/conda/bin/text-generation-server", line 8, in <module> llm2.internal | sys.exit(app()) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve llm2.internal | server.serve( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve llm2.internal | asyncio.run( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run llm2.internal | return loop.run_until_complete(main) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete llm2.internal | return future.result() llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner llm2.internal | model = get_model( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model llm2.internal | return CausalLM.fallback( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback llm2.internal | model = AutoModelForCausalLM.from_pretrained( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained llm2.internal | return model_class.from_pretrained( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3341, in from_pretrained llm2.internal | hf_quantizer.validate_environment( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_awq.py", line 53, in validate_environment llm2.internal | raise ImportError("Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`)") llm2.internal | llm2.internal | ImportError: Loading an AWQ quantized model requires auto-awq library (`pip install autoawq`) llm2.internal | rank=0 llm2.internal | 2024-08-22T06:35:41.580962Z ERROR text_generation_launcher: Shard 0 failed to start llm2.internal | 2024-08-22T06:35:41.580977Z INFO text_generation_launcher: Shutting down shards llm2.internal | Error: ShardCannotStart
Second, I tried Llama 3 70B with GPTQ:
llm2.internal | 2024-08-22T07:50:05.356461Z INFO text_generation_launcher: Args { llm2.internal | model_id: "MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ", llm2.internal | revision: None, llm2.internal | validation_workers: 2, llm2.internal | sharded: Some( llm2.internal | false, llm2.internal | ), llm2.internal | num_shard: None, llm2.internal | quantize: Some( llm2.internal | Gptq, llm2.internal | ), llm2.internal | speculate: None, llm2.internal | dtype: None, llm2.internal | trust_remote_code: false, llm2.internal | max_concurrent_requests: 128, llm2.internal | max_best_of: 2, llm2.internal | max_stop_sequences: 4, llm2.internal | max_top_n_tokens: 5, llm2.internal | max_input_tokens: Some( llm2.internal | 1500, llm2.internal | ), llm2.internal | max_input_length: None, llm2.internal | max_total_tokens: None, llm2.internal | waiting_served_ratio: 0.3, llm2.internal | max_batch_prefill_tokens: None, llm2.internal | max_batch_total_tokens: None, llm2.internal | max_waiting_tokens: 20, llm2.internal | max_batch_size: None, llm2.internal | cuda_graphs: None, llm2.internal | hostname: "llm2.internal", llm2.internal | port: 80, llm2.internal | shard_uds_path: "/tmp/text-generation-server", llm2.internal | master_addr: "localhost", llm2.internal | master_port: 29500, llm2.internal | huggingface_hub_cache: Some( llm2.internal | "/data", llm2.internal | ), llm2.internal | weights_cache_override: None, llm2.internal | disable_custom_kernels: false, llm2.internal | cuda_memory_fraction: 1.0, llm2.internal | rope_scaling: None, llm2.internal | rope_factor: None, llm2.internal | json_output: false, llm2.internal | otlp_endpoint: None, llm2.internal | otlp_service_name: "text-generation-inference.router", llm2.internal | cors_allow_origin: [], llm2.internal | watermark_gamma: None, llm2.internal | watermark_delta: None, llm2.internal | ngrok: false, llm2.internal | ngrok_authtoken: None, llm2.internal | ngrok_edge: None, llm2.internal | tokenizer_config_path: None, llm2.internal | disable_grammar_support: false, llm2.internal | env: false, llm2.internal | max_client_batch_size: 4, llm2.internal | lora_adapters: None, llm2.internal | disable_usage_stats: false, llm2.internal | disable_crash_reports: false, llm2.internal | } llm2.internal | 2024-08-22T07:50:05.356599Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token" llm2.internal | 2024-08-22T07:50:33.497367Z INFO text_generation_launcher: Default `max_total_tokens` to 4096 llm2.internal | 2024-08-22T07:50:33.497398Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 1550 llm2.internal | 2024-08-22T07:50:33.497406Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] llm2.internal | 2024-08-22T07:50:33.497694Z INFO download: text_generation_launcher: Starting check and download process for MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ llm2.internal | 2024-08-22T07:50:37.443461Z INFO text_generation_launcher: Download file: model.safetensors llm2.internal | 2024-08-22T07:57:59.818601Z INFO text_generation_launcher: Downloaded /data/models--MaziyarPanahi--Meta-Llama-3-70B-Instruct-GPTQ/snapshots/46c7afccd4f9345a3d43c1468fde1034cf0a0932/model.safetensors in 0:07:22. llm2.internal | 2024-08-22T07:57:59.818719Z INFO text_generation_launcher: Download: [1/1] -- ETA: 0 llm2.internal | 2024-08-22T07:58:00.452718Z INFO download: text_generation_launcher: Successfully downloaded weights for MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ llm2.internal | 2024-08-22T07:58:00.453007Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 llm2.internal | 2024-08-22T07:58:03.658769Z WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false. llm2.internal | 2024-08-22T07:58:06.383846Z ERROR text_generation_launcher: Error when initializing model llm2.internal | Traceback (most recent call last): llm2.internal | File "/opt/conda/bin/text-generation-server", line 8, in <module> llm2.internal | sys.exit(app()) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__ llm2.internal | return get_command(self)(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ llm2.internal | return self.main(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main llm2.internal | return _main( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main llm2.internal | rv = self.invoke(ctx) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke llm2.internal | return _process_result(sub_ctx.command.invoke(sub_ctx)) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke llm2.internal | return ctx.invoke(self.callback, **ctx.params) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke llm2.internal | return __callback(*args, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper llm2.internal | return callback(**use_params) # type: ignore llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve llm2.internal | server.serve( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve llm2.internal | asyncio.run( llm2.internal | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run llm2.internal | return loop.run_until_complete(main) llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete llm2.internal | self.run_forever() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever llm2.internal | self._run_once() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once llm2.internal | handle._run() llm2.internal | File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run llm2.internal | self._context.run(self._callback, *self._args) llm2.internal | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner llm2.internal | model = get_model( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model llm2.internal | return CausalLM.fallback( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback llm2.internal | model = AutoModelForCausalLM.from_pretrained( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained llm2.internal | return model_class.from_pretrained( llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3336, in from_pretrained llm2.internal | hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py", line 136, in from_config llm2.internal | return target_cls(quantization_config, **kwargs) llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_gptq.py", line 47, in __init__ llm2.internal | from optimum.gptq import GPTQQuantizer llm2.internal | ModuleNotFoundError: No module named 'optimum' llm2.internal | 2024-08-22T07:58:07.062648Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: llm2.internal | llm2.internal | 2024-08-22 07:58:01.988 | INFO | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda llm2.internal | Traceback (most recent call last): llm2.internal | llm2.internal | File "/opt/conda/bin/text-generation-server", line 8, in <module> llm2.internal | sys.exit(app()) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve llm2.internal | server.serve( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve llm2.internal | asyncio.run( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run llm2.internal | return loop.run_until_complete(main) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete llm2.internal | return future.result() llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner llm2.internal | model = get_model( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 719, in get_model llm2.internal | return CausalLM.fallback( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 612, in fallback llm2.internal | model = AutoModelForCausalLM.from_pretrained( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained llm2.internal | return model_class.from_pretrained( llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3336, in from_pretrained llm2.internal | hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/auto.py", line 136, in from_config llm2.internal | return target_cls(quantization_config, **kwargs) llm2.internal | llm2.internal | File "/opt/conda/lib/python3.10/site-packages/transformers/quantizers/quantizer_gptq.py", line 47, in __init__ llm2.internal | from optimum.gptq import GPTQQuantizer llm2.internal | llm2.internal | ModuleNotFoundError: No module named 'optimum' llm2.internal | rank=0 llm2.internal | 2024-08-22T07:58:07.161187Z ERROR text_generation_launcher: Shard 0 failed to start llm2.internal | 2024-08-22T07:58:07.161212Z INFO text_generation_launcher: Shutting down shards llm2.internal | Error: ShardCannotStart
I am running docker container from within docker compose with params, for awq i obviously changed the container and command.
llm2: container_name: llm2.internal hostname: llm2.internal profiles: - common image: ghcr.io/huggingface/text-generation-inference:2.2.0 command: --model-id MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ volumes: - /home/llm_data:/data ports: - "2450:80" environment: - HF_HUB_ENABLE_HF_TRANSFER="false" - USE_FLASH_ATTENTION=False - MAX_INPUT_TOKENS=1500 - SHARDED=false - HF_TOKEN= - QUANTIZE=gptq deploy: resources: reservations: devices: - driver: nvidia device_ids: ["4","5","6","7"] capabilities: [gpu] networks: - container-network
OS: Ubuntu 22.04.4 LTS Rust version: N/A Container version: 2.2.0 - sha256:d39d513f13727ffa9b6a4d0e949f36413b944aabc9a236c0aa2986c929906769 Model being used: Llama 3.1 and 3.0 GPUs: 4x Volta V100 - hence disabled Flash attention
Quantized Llamas should run in docker container.
Did double check, this issue is indeed caused by the lack of flash attention support on V100s. No such problem on Ada generation, but once you turn flash attention off, it starts to happen there as well.
System Info
Hi everyone,
i was trying out to run quantized Llama with dockerized TGI and run into issues. First, I tried AWQ with Llama 3.1 70B:
Second, I tried Llama 3 70B with GPTQ:
I am running docker container from within docker compose with params, for awq i obviously changed the container and command.
OS: Ubuntu 22.04.4 LTS Rust version: N/A Container version: 2.2.0 - sha256:d39d513f13727ffa9b6a4d0e949f36413b944aabc9a236c0aa2986c929906769 Model being used: Llama 3.1 and 3.0 GPUs: 4x Volta V100 - hence disabled Flash attention
Information
Tasks
Reproduction
Expected behavior
Quantized Llamas should run in docker container.