huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.34k stars 941 forks source link

Unable to load the local model file into LoRA adaptors #2143

Open mhou7712 opened 6 days ago

mhou7712 commented 6 days ago

System Info

text-generation-launcher 2.1.0

Information

Tasks

Reproduction

Execute the following statement inside of ghcr.io/huggingface/text-generation-inference:2.1.0:

text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "/var/spool/llm_models/checkpoint-576" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096

where "/var/spool/llm_models/Mistral-7B-v0.1_032124" and "/var/spool/llm_models/checkpoint-576" are the local filesystem.

The log and error message are shown below:

2024-06-28T22:38:37.606983Z INFO text_generation_launcher: Args { model_id: "/var/spool/llm_models/Mistral-7B-v0.1_032124", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 4096, ), max_total_tokens: Some( 5000, ), waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 5029, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.9, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: Some( "/var/spool/llm_models/checkpoint-576", ), } 2024-06-28T22:38:37.607480Z INFO text_generation_launcher: Defaultmax_batch_prefill_tokens` to 4146 2024-06-28T22:38:37.607491Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-06-28T22:38:37.607705Z INFO download: text_generation_launcher: Starting download process. 2024-06-28T22:38:40.422554Z INFO text_generation_launcher: Detected system cuda 2024-06-28T22:38:42.803387Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-06-28T22:38:43.815171Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-06-28T22:38:43.815615Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-06-28T22:38:46.949677Z INFO text_generation_launcher: Detected system cuda 2024-06-28T22:38:49.127548Z WARN text_generation_launcher: LoRA adapters are enabled. This is an experimental feature and may not work as expected. 2024-06-28T22:38:53.826828Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:03.854669Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:13.862108Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:23.871495Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:33.955301Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:43.964370Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-28T22:39:44.362823Z INFO text_generation_launcher: Loading adapter weights into model: /var/spool/llm_models/checkpoint-576 2024-06-28T22:39:44.607361Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 256, in serve_inner model.load_adapter( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/model.py", line 214, in load_adapter ) = load_and_merge_adapters( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/adapter.py", line 53, in load_and_merge_adapters return load_module_map( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/adapter.py", line 173, in load_module_map adapter_filenames = hub._cached_adapter_weight_files( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/hub.py", line 25, in _cached_adapter_weight_files d = _get_cached_revision_directory(adapter_id, revision) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/hub.py", line 108, in _get_cached_revision_directory file_download.repo_folder_name(repo_id=model_id, repo_type="model") File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/var/spool/llm_models/checkpoint-576'. Use repo_type argument if needed. `

It seems that adaptor does not recognize "/var/spool/llm_models/checkpoint-576" is the local filesystem.

If I execute the following command without adaptor:

text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096

The base model is loaded without an issue.

Expected behavior

Can we have the adaptor to load the model file from the local filesystem likes "--model-id " does?

mhou7712 commented 6 days ago

I tried the example as the web example for LoRA:

text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "predibase/customer_support" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096

It worked fine.

So, the adaptor can load model file from repo but not from local.

Egelvein commented 4 days ago

+, the same problem

flozi00 commented 3 days ago

Are you using docker enviroments ?

mhou7712 commented 3 days ago

yes and downloaded from "ghcr.io/huggingface/text-generation-inference:2.1.0".

Question: can the adaptor load model files from local instead repo? Thanks.

flozi00 commented 3 days ago

Maybe you didn't mounted the folder containing the weights as volume into the container ?

mhou7712 commented 3 days ago

Please compare the following two command lines:

LoRA with repo: text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "predibase/customer_support" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096

LoRA with local: text-generation-launcher --hostname 0.0.0.0 -p 5029 -e --lora-adapters "/var/spool/llm_models/checkpoint-576" --model-id "/var/spool/llm_models/Mistral-7B-v0.1_032124" --cuda-memory-fraction 0.90 --max-total-tokens 5000 --max-input-length 4096

"LoRA with repo" works with me and "/var/spool/llm_models/Mistral-7B-v0.1_032124" is visible inside the container for the base model, and the same volume "/var/spool/llm_models/" is visible for "LoRA with local". Yes ,"checkpoint-576" is accessible under "/var/spool/llm_models" inside the container.

Thanks for asking and I did check all model files are available inside container.

bwhartlove commented 2 days ago

+1 same issue

newsbreakDuadua9 commented 2 days ago

same issue here The whole feature is not working in docker env. Even passing a random string as adapter_id, the inference client would still accept it. The lora is not enabled at all!

mhou7712 commented 1 day ago

@newsbreakDuadua9 a quick question if adaptor_id is not a repo then adaptor assumes something (I mean something can be the local filesystem directory or it does not exist), right?

I have not tested if I pass a random string with --model-id. Yea, it is a good test.

Thanks.

mhou7712 commented 1 day ago

@flozi00 I am wondering that this issue can be assigned with the adaptor expert so he/she can help us for looking into the issue. Thanks.