huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.35k stars 944 forks source link

Cannot load model HuggingFaceM4/idefics2-8b-AWQ #2036

Open jla346 opened 1 month ago

jla346 commented 1 month ago

System Info

2024-06-07T10:44:47.805201Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "62c9636e946d", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, }

Information

Tasks

Reproduction

Error generated after trying to load model "HuggingFaceM4/idefics2-8b-AWQ" by using latest docker image (arguments used: --model-id HuggingFaceM4/idefics2-8b-AWQ)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 297, in Idefics2EncoderLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 255, in init self.self_attn = Idefics2VisionAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 152, in init self.qkv = TensorParallelColumnLinear.load_multi(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 162, in load_multi weight = weights.get_multi_weights_col(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 271, in get_multi_weights_col raise RuntimeError(

RuntimeError: Cannot load awq weight, make sure the model is already quantized rank=0 Error: ShardCannotStart 2024-06-07T10:36:51.919698Z ERROR text_generation_launcher: Shard 0 failed to start 2024-06-07T10:36:51.919731Z INFO text_generation_launcher: Shutting down shards

Expected behavior

Model loaded correctly.

Nehc commented 3 weeks ago

i have the same problem...

LysandreJik commented 3 weeks ago

Hello! I'd be happy to help you both out but I think I'm missing the full command you used to run the server. Do you mind sharing the command (either the docker command or the text-generation-launcher command).

It would also help me significantly if I could have access to your machine setup. You can show that by doing --env:

text-generation-launcher --env

Thanks a lot!

Nehc commented 3 weeks ago

everything is as simple as possible: docker run --gpus '"device=0"' -p 8080:80 -v /home/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference --model-id HuggingFaceM4/idefics2-8b-AWQ --quantize awq. Сonnecting to docker and running text-generation-launcher --env there is quite problematic.

LysandreJik commented 3 weeks ago

Thanks a lot @Nehc, I appreciate it.

@danieldk I think it's quite close to the code that you contributed a few weeks back; I verified I indeed get the same failure on the 2.0.4 docker. If you have the bandwidth to take a look at it, it would be awesome

Nehc commented 3 weeks ago

you're welcome! ) LysandreJik, tell me, maybe this problem didn’t exist in the previous release and I just need to take an image not the latest, but some earlier one? Can you tell me which one?