Feature request

I have download the model, so I want to run it use local model, eht sample is: docker run --gpus all --shm-size 1g -p 8080:80 -v /data/model/:/data/ \ ghcr.io/predibase/lorax:latest --model-id /data/model/Qwen-14B-Chat

Motivation

I want to use the local model. Our computes don't allow to visit huggingface.co.

Your contribution

No.

I use the below code to save model: from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig

model_dir = '/data/model/Qwen-14B-Chat-hf-save'

name = 'Qwen/Qwen-14B-Chat' tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(name, device_map="auto", trust_remote_code=True, bf16=True).eval() tokenizer.save_pretrained(model_dir) model.save_pretrained(model_dir)

And run: docker run --gpus all --shm-size 1g -p 8080:80 -v /data/model/:/data/ \ ghcr.io/predibase/lorax:latest --model-id /data/Qwen-14B-Chat-hf-save

The error is: 2023-12-26T02:25:38.134822Z INFO lorax_launcher: Args { model_id: "/data/Qwen-14B-Chat-hf-save", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "dd28579e51fc", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2023-12-26T02:25:38.134920Z INFO download: lorax_launcher: Starting download process. 2023-12-26T02:25:41.114273Z INFO lorax_launcher: cli.py:103 Files are already present on the host. Skipping download.

2023-12-26T02:25:41.537340Z INFO download: lorax_launcher: Successfully downloaded weights. 2023-12-26T02:25:41.537502Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2023-12-26T02:25:44.896833Z ERROR lorax_launcher: server.py:233 Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 229, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model return FlashQwen( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_qwen.py", line 54, in init tokenizer = AutoTokenizer.from_pretrained( File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained raise ValueError( ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

2023-12-26T02:25:45.540361Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve server.serve(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 229, in serve_inner model = get_model(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model return FlashQwen(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_qwen.py", line 54, in init tokenizer = AutoTokenizer.from_pretrained(

File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained raise ValueError(

ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported. rank=0 2023-12-26T02:25:45.639763Z ERROR lorax_launcher: Shard 0 failed to start 2023-12-26T02:25:45.639783Z INFO lorax_launcher: Shutting down shards Error: ShardCannotStart

The dir /data/model/Qwen-14B-Chat-hf-save: total 27673116 drwxr-xr-x 2 root root 4096 Dec 25 19:14 ./ drwxr-xr-x 3 root root 4096 Dec 26 10:25 ../ -rw-r--r-- 1 root root 1110 Dec 25 19:13 config.json -rw-r--r-- 1 root root 250 Dec 25 19:13 generation_config.json -rw-r--r-- 1 root root 4919444336 Dec 25 19:13 model-00001-of-00006.safetensors -rw-r--r-- 1 root root 4991627864 Dec 25 19:13 model-00002-of-00006.safetensors -rw-r--r-- 1 root root 4886749824 Dec 25 19:13 model-00003-of-00006.safetensors -rw-r--r-- 1 root root 4903809664 Dec 25 19:14 model-00004-of-00006.safetensors -rw-r--r-- 1 root root 4903820016 Dec 25 19:14 model-00005-of-00006.safetensors -rw-r--r-- 1 root root 3729165312 Dec 25 19:14 model-00006-of-00006.safetensors -rw-r--r-- 1 root root 24387 Dec 25 19:14 model.safetensors.index.json -rw-r--r-- 1 root root 2561218 Dec 25 19:13 qwen.tiktoken -rw-r--r-- 1 root root 3 Dec 25 19:13 special_tokens_map.json -rw-r--r-- 1 root root 261 Dec 25 19:13 tokenizer_config.json

Or support model from www.modelscope.cn like vllm？

Hey @yinjiaoyuan, apologies for the delayed response, I've been out on holiday this week.

To answer your first question, it looks like the current issue you're seeing is due to the model using a third party tokenizer (not natively part of HF):

ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

You can resolve this issue by running LoRAX with --trust-remote-code so it downloads the tokenizer from the model repo.

Let me know if that resolves your issue.

I'd be happy to add support for ModelScope, though I would need to explore it to understand the API a bit better (sounds like vllm would be a good reference for that). Contributions also welcome if you'd be interested in adding it.

But my linux server can't visit the site "https://huggingface.co/" in China.

Hey @yinjiaoyuan, you should be able to save the HF tokenizer to the same directory you saved the model weights to, which will allow LoRAX to read from the cache instead of going over the internet. For example, here are the contents of my local directory cache:

ls /data/models--Qwen--Qwen-7B/snapshots/ffe04dd57f85293043ba999a2c0daa788d6182e9/
config.json                       model-00003-of-00008.safetensors  model-00006-of-00008.safetensors  qwen.tiktoken
model-00001-of-00008.safetensors  model-00004-of-00008.safetensors  model-00007-of-00008.safetensors  tokenization_qwen.py
model-00002-of-00008.safetensors  model-00005-of-00008.safetensors  model-00008-of-00008.safetensors  tokenizer_config.json

You can see there's a file called tokenization_qwen.py that contains the tokenizer implementation. If you can write out this file locally, you should be able to avoid hitting the HF servers.

Hey @yinjiaoyuan, you should be able to save the HF tokenizer to the same directory you saved the model weights to, which will allow LoRAX to read from the cache instead of going over the internet. For example, here are the contents of my local directory cache:
ls /data/models--Qwen--Qwen-7B/snapshots/ffe04dd57f85293043ba999a2c0daa788d6182e9/
config.json                       model-00003-of-00008.safetensors  model-00006-of-00008.safetensors  qwen.tiktoken
model-00001-of-00008.safetensors  model-00004-of-00008.safetensors  model-00007-of-00008.safetensors  tokenization_qwen.py
model-00002-of-00008.safetensors  model-00005-of-00008.safetensors  model-00008-of-00008.safetensors  tokenizer_config.json
You can see there's a file called tokenization_qwen.py that contains the tokenizer implementation. If you can write out this file locally, you should be able to avoid hitting the HF servers.

Yes, I dowload the hf model on the foreign host and scp the model files to the host in the China, but I run lorax and input --model-id /xxx/yyy/data/models--Qwen--Qwen-7B-Chat/snapshots/218aa3240fd5a5d1e80bb6c47d5d774361913706/, /data/models--Qwen--Qwen-7B-Chat/snapshots/218aa3240fd5a5d1e80bb6c47d5d774361913706/ is the dir of the model fies, but NOT ok, the error info is: 2024-01-08T03:11:04.139756Z INFO lorax_launcher: Args { model_id: "/data/models--Qwen--Qwen-7B-Chat/snapshots/218aa3240fd5a5d1e80bb6c47d5d774361913706/", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "f06b11000cff", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false } 2024-01-08T03:11:04.139862Z INFO download: lorax_launcher: Starting download process. 2024-01-08T03:11:07.000076Z INFO lorax_launcher: cli.py:103 Files are already present on the host. Skipping download.

2024-01-08T03:11:07.442269Z INFO download: lorax_launcher: Successfully downloaded weights. 2024-01-08T03:11:07.442447Z INFO shard-manager: lorax_launcher: Starting shard rank=0 2024-01-08T03:11:11.104509Z ERROR lorax_launcher: server.py:233 Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/lorax-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 229, in serve_inner model = get_model( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 321, in get_model return FlashQwen( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_qwen.py", line 54, in init tokenizer = AutoTokenizer.from_pretrained( File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained raise ValueError( ValueError: Tokenizer class QWenTokenizer does not exist or is not currently imported.

2024-01-08T03:11:11.645631Z ERROR shard-manager: lorax_launcher: Shard complete standard error output: