Closed LYHGeorge closed 9 months ago
Seems like you're lacking flash attention and GPTQ kernels:
cd server && make install install-flash-attention-v2-cuda
make install exllama
Should help here. The error is a bit strange though, it's possible you already have something running on the socket. You can check out options here: https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#shardudspath
Reference in n
Thanks for your Suggestion.
I try to set --port 6666
,but it not work.
netstat -ap
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 ad0f402c2ce7:48022 180.119.146.98:443 TIME_WAIT -
tcp 0 0 ad0f402c2ce7:49780 ubuntu-mirror-3.ps6.:80 TIME_WAIT -
tcp 0 0 ad0f402c2ce7:38122 ubuntu-mirror-2.ps6.:80 TIME_WAIT -
tcp 0 0 ad0f402c2ce7:40170 152.199.39.144:443 TIME_WAIT -
tcp 0 0 ad0f402c2ce7:56338 actiontoad.canonical:80 TIME_WAIT -
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node PID/Program name Path
It seems like nothing running on the 6666.
The problem was solved by using the official Docker image.:)
System Info
model:defog/sqlcoder quantize:GPTQ-4bit GPU:2*4090
Information
Tasks
Reproduction
cli: text-generation-launcher --model-id /models/sqlcoder-gptq-4bit/ --num-shard 2 --quantize gptq
error code
: 2024-01-11T09:06:39.592524Z INFO text_generation_launcher: Args { model_id: "/models/sqlcoder-gptq-4bit/", revision: None, validation_workers: 2, sharded: None, num_shard: Some(2), quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "ad0f402c2ce7", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false } 2024-01-11T09:06:39.592613Z INFO text_generation_launcher: Sharding model on 2 processes 2024-01-11T09:06:39.592862Z INFO download: text_generation_launcher: Starting download process. 2024-01-11T09:06:42.853412Z INFO text_generation_launcher: Files are already present on the host. Skipping download.2024-01-11T09:06:43.198415Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-01-11T09:06:43.198983Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-01-11T09:06:43.199081Z INFO shard-manager: text_generation_launcher: Starting shard rank=1 2024-01-11T09:06:47.039277Z WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
2024-01-11T09:06:47.098756Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
2024-01-11T09:06:47.211590Z WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
2024-01-11T09:06:47.272995Z INFO text_generation_launcher: Discovered apex.normalization.FusedRMSNorm - will use it instead of T5LayerNorm
2024-01-11T09:06:48.089794Z WARN text_generation_launcher: Unable to use Flash Attention V2: Flash Attention V2 is not installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with
cd server && make install install-flash-attention-v2-cuda
2024-01-11T09:06:48.119529Z WARN text_generation_launcher: Mixtral: megablocks is not installed
2024-01-11T09:06:48.282723Z WARN text_generation_launcher: Unable to use Flash Attention V2: Flash Attention V2 is not installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention v2 with
cd server && make install install-flash-attention-v2-cuda
2024-01-11T09:06:48.314319Z WARN text_generation_launcher: Mixtral: megablocks is not installed
2024-01-11T09:06:50.254960Z WARN text_generation_launcher: Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True
2024-01-11T09:06:50.264358Z WARN text_generation_launcher: Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True
2024-01-11T09:06:53.212506Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1 2024-01-11T09:06:53.214921Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-01-11T09:06:54.181535Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-01-11T09:06:54.215925Z INFO shard-manager: text_generation_launcher: Shard ready in 11.015229328s rank=0 2024-01-11T09:06:54.267571Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-01-11T09:06:54.313920Z INFO shard-manager: text_generation_launcher: Shard ready in 11.112762893s rank=1 2024-01-11T09:06:54.410728Z INFO text_generation_launcher: Starting Webserver 2024-01-11T09:06:54.511099Z WARN text_generation_router: router/src/main.rs:194: no pipeline tag found for model /models/sqlcoder-gptq-4bit/ 2024-01-11T09:06:54.513826Z ERROR service_discovery: text_generation_client: router/client/src/lib.rs:33: Server error: Method not found! Error: Connection(Generation("Method not found!")) 2024-01-11T09:06:54.612279Z ERROR text_generation_launcher: Webserver Crashed 2024-01-11T09:06:54.612302Z INFO text_generation_launcher: Shutting down shards 2024-01-11T09:06:54.846323Z INFO shard-manager: text_generation_launcher: Shard terminated rank=1 2024-01-11T09:06:54.921328Z INFO shard-manager: text_generation_launcher: Shard terminated rank=0 Error: WebserverFailed
Expected behavior
Plz tell me,how to deal with this promble?