Open mhou7712 opened 6 days ago
I tried the example as the web example for LoRA:
text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "predibase/customer_support" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096
It worked fine.
So, the adaptor can load model file from repo but not from local.
+, the same problem
Are you using docker enviroments ?
yes and downloaded from "ghcr.io/huggingface/text-generation-inference:2.1.0".
Question: can the adaptor load model files from local instead repo? Thanks.
Maybe you didn't mounted the folder containing the weights as volume into the container ?
Please compare the following two command lines:
LoRA with repo:
text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "predibase/customer_support" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096
LoRA with local:
text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "/var/spool/llm_models/checkpoint-576" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096
"LoRA with repo" works with me and "/var/spool/llm_models/Mistral-7B-v0.1_032124" is visible inside the container for the base model, and the same volume "/var/spool/llm_models/" is visible for "LoRA with local". Yes ,"checkpoint-576" is accessible under "/var/spool/llm_models" inside the container.
Thanks for asking and I did check all model files are available inside container.
+1 same issue
same issue here The whole feature is not working in docker env. Even passing a random string as adapter_id, the inference client would still accept it. The lora is not enabled at all!
@newsbreakDuadua9 a quick question if adaptor_id is not a repo then adaptor assumes something (I mean something can be the local filesystem directory or it does not exist), right?
I have not tested if I pass a random string with --model-id. Yea, it is a good test.
Thanks.
@flozi00 I am wondering that this issue can be assigned with the adaptor expert so he/she can help us for looking into the issue. Thanks.
System Info
text-generation-launcher 2.1.0
Information
Tasks
Reproduction
Execute the following statement inside of ghcr.io/huggingface/text-generation-inference:2.1.0:
text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "/var/spool/llm_models/checkpoint-576" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096
where "/var/spool/llm_models/Mistral-7B-v0.1_032124" and "/var/spool/llm_models/checkpoint-576" are the local filesystem.
The log and error message are shown below:
2024-06-28T22:38:37.606983Z INFO text_generation_launcher: Args { model_id: "/var/spool/llm_models/Mistral-7B-v0.1_032124", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 4096, ), max_total_tokens: Some( 5000, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 5029, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.9, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: Some( "/var/spool/llm_models/checkpoint-576", ), } 2024-06-28T22:38:37.607480Z INFO text_generation_launcher: Default
max_batch_prefill_tokens` to 4146 2024-06-28T22:38:37.607491Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-06-28T22:38:37.607705Z INFO download: text_generation_launcher: Starting download process. 2024-06-28T22:38:40.422554Z INFO text_generation_launcher: Detected system cuda 2024-06-28T22:38:42.803387Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-06-28T22:38:43.815171Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-06-28T22:38:43.815615Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-06-28T22:38:46.949677Z INFO text_generation_launcher: Detected system cuda 2024-06-28T22:38:49.127548Z WARN text_generation_launcher: LoRA adapters are enabled. This is an experimental feature and may not work as expected. 2024-06-28T22:38:53.826828Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:03.854669Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:13.862108Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:23.871495Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:33.955301Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:43.964370Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:44.362823Z INFO text_generation_launcher: Loading adapter weights into model: /var/spool/llm_models/checkpoint-576 2024-06-28T22:39:44.607361Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, inIt seems that adaptor does not recognize "/var/spool/llm_models/checkpoint-576" is the local filesystem.
If I execute the following command without adaptor:
text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096
The base model is loaded without an issue.
Expected behavior
Can we have the adaptor to load the model file from the local filesystem likes "--model-id " does?