Open jla346 opened 1 month ago
i have the same problem...
Hello! I'd be happy to help you both out but I think I'm missing the full command you used to run the server. Do you mind sharing the command (either the docker command or the text-generation-launcher command).
It would also help me significantly if I could have access to your machine setup. You can show that by doing --env
:
text-generation-launcher --env
Thanks a lot!
everything is as simple as possible: docker run --gpus '"device=0"' -p 8080:80 -v /home/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference --model-id HuggingFaceM4/idefics2-8b-AWQ --quantize awq
. Сonnecting to docker and running text-generation-launcher --env
there is quite problematic.
Thanks a lot @Nehc, I appreciate it.
@danieldk I think it's quite close to the code that you contributed a few weeks back; I verified I indeed get the same failure on the 2.0.4 docker. If you have the bandwidth to take a look at it, it would be awesome
you're welcome! )
LysandreJik, tell me, maybe this problem didn’t exist in the previous release and I just need to take an image not the latest
, but some earlier one? Can you tell me which one?
System Info
2024-06-07T10:44:47.805201Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "62c9636e946d", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, }
Information
Tasks
Reproduction
Error generated after trying to load model "HuggingFaceM4/idefics2-8b-AWQ" by using latest docker image (arguments used: --model-id HuggingFaceM4/idefics2-8b-AWQ)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 297, in
Idefics2EncoderLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 255, in init self.self_attn = Idefics2VisionAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 152, in init self.qkv = TensorParallelColumnLinear.load_multi(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 162, in load_multi weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 271, in get_multi_weights_col raise RuntimeError(
RuntimeError: Cannot load
awq
weight, make sure the model is already quantized rank=0 Error: ShardCannotStart 2024-06-07T10:36:51.919698Z ERROR text_generation_launcher: Shard 0 failed to start 2024-06-07T10:36:51.919731Z INFO text_generation_launcher: Shutting down shardsExpected behavior
Model loaded correctly.