Closed atanikan closed 8 months ago
Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work.
To serve vLLM API:
#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.api_server \
--host=127.0.0.1 \
--port=8000 \
--model=$HOME/models/$MODEL_NAME \
--tokenizer=hf-internal-testing/llama-tokenizer
Serve OpenAI API:
#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.openai.api_server \
--host=127.0.0.1 \
--port=8000 \
--model=$HOME/models/$MODEL_NAME \
--tokenizer=hf-internal-testing/llama-tokenizer
To run on multiple GPUs add: --tensor-parallel-size=N
where N is the number of GPUs
The tokenizer above works at least for the Llama 2 based models. It results in faster startup time:
--tokenizer=hf-internal-testing/llama-tokenizer
Additional parameters you may want to tune:
--block-size
and --swap-space
How can I add locally saved models from within python code? What parameter should I use to specify the local model here LLM(model='model_name')?
Pass the absolute path of your model directory in the model
parameter, that should work. I use it that way all the time. Yeah, the documentation is not too clear about it.
If you want to use a path relative to your home directory, then you can do this:
model_dir = os.path.expanduser('~/models/Some/Model')
llm = LLM(model=model_dir, ...)
Is it possible to provide a "model dir" which contains a lot of pre-trained models, and I can specify a model name load from "model dir". vLLM openai.api_server use model parameter as model name.
No. You need to provide the path to the model directory with the actual model files (config.json, etc) in it. It would be possible to provide a --model-base-dir
or something like that, but what vLLM would do is just joining the base path with the model ID, so it is not much of a value for added complexity.
@viktor-ferenczi We have a downloaded version of llama-2-70b.
I see an error : llama-2-70b-chat does not appear to have a file named config.json
(2022-07-01/vllm_conda_env) atanikanti@thetagpu02:/lus/grand/projects/datascience/atanikanti/vllm_service/vllm_serve$ ./serve.sh llama-2-70b-chat /eagle/datascience/venkatv/datasets/llama
Traceback (most recent call last):
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 436, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/arg_utils.py", line 153, in create_engine_configs
model_config = ModelConfig(self.model, self.tokenizer,
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/config.py", line 62, in __init__
self.hf_config = get_config(model, trust_remote_code)
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 17, in get_config
config = AutoConfig.from_pretrained(
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1023, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 675, in _get_config_dict
resolved_config_file = cached_file(
File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/utils/hub.py", line 400, in cached_file
raise EnvironmentError(
OSError: /eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat does not appear to have a file named config.json. Checkout 'https://huggingface.co//eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat/None' for available files.
Does it have to be a hugging face model?
@viktor-ferenczi We have a downloaded version of llama-2-70b.
I see an error : llama-2-70b-chat does not appear to have a file named config.json
(2022-07-01/vllm_conda_env) atanikanti@thetagpu02:/lus/grand/projects/datascience/atanikanti/vllm_service/vllm_serve$ ./serve.sh llama-2-70b-chat /eagle/datascience/venkatv/datasets/llama Traceback (most recent call last): File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module> engine = AsyncLLMEngine.from_engine_args(engine_args) File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 436, in from_engine_args engine_configs = engine_args.create_engine_configs() File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/arg_utils.py", line 153, in create_engine_configs model_config = ModelConfig(self.model, self.tokenizer, File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/config.py", line 62, in __init__ self.hf_config = get_config(model, trust_remote_code) File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 17, in get_config config = AutoConfig.from_pretrained( File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1023, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 675, in _get_config_dict resolved_config_file = cached_file( File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/utils/hub.py", line 400, in cached_file raise EnvironmentError( OSError: /eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat does not appear to have a file named config.json. Checkout 'https://huggingface.co//eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat/None' for available files.
Does it have to be a hugging face model?
Yes, currently vllm requires it to be a HF model. Related code: https://github.com/vllm-project/vllm/blob/6b5296aa3ae632b8f2dcbc78579eb41b28e41068/vllm/transformers_utils/config.py#L30
Is it possible to load models saved locally (same format as the supported vLLM format/HF model types) ?
Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work.
To serve vLLM API:
#!/bin/bash MODEL_NAME="$1" test -n "$MODEL_NAME" MODEL_DIR="$HOME/models/$MODEL_NAME" test -d "$MODEL_DIR" python -O -u -m vllm.entrypoints.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$HOME/models/$MODEL_NAME \ --tokenizer=hf-internal-testing/llama-tokenizer
Serve OpenAI API:
#!/bin/bash MODEL_NAME="$1" test -n "$MODEL_NAME" MODEL_DIR="$HOME/models/$MODEL_NAME" test -d "$MODEL_DIR" python -O -u -m vllm.entrypoints.openai.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$HOME/models/$MODEL_NAME \ --tokenizer=hf-internal-testing/llama-tokenizer
To run on multiple GPUs add:
--tensor-parallel-size=N
where N is the number of GPUsThe tokenizer above works at least for the Llama 2 based models. It results in faster startup time:
--tokenizer=hf-internal-testing/llama-tokenizer
Additional parameters you may want to tune:
--block-size
and--swap-space
i used a model meta-llama from huggingface_hub along with vllm eg:- llm = vllm.LLM(model=model_id,gpu_memory_utilization=0.25) im able to load it but when i try to load the fine-tuned model from meta-llama in my local repo im always getting an error wrong_path , eventhough the path is correct , how do i load vllm.LLM() for my locally available model
Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work. To serve vLLM API:
#!/bin/bash MODEL_NAME="$1" test -n "$MODEL_NAME" MODEL_DIR="$HOME/models/$MODEL_NAME" test -d "$MODEL_DIR" python -O -u -m vllm.entrypoints.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$HOME/models/$MODEL_NAME \ --tokenizer=hf-internal-testing/llama-tokenizer
Serve OpenAI API:
#!/bin/bash MODEL_NAME="$1" test -n "$MODEL_NAME" MODEL_DIR="$HOME/models/$MODEL_NAME" test -d "$MODEL_DIR" python -O -u -m vllm.entrypoints.openai.api_server \ --host=127.0.0.1 \ --port=8000 \ --model=$HOME/models/$MODEL_NAME \ --tokenizer=hf-internal-testing/llama-tokenizer
To run on multiple GPUs add:
--tensor-parallel-size=N
where N is the number of GPUs The tokenizer above works at least for the Llama 2 based models. It results in faster startup time:--tokenizer=hf-internal-testing/llama-tokenizer
Additional parameters you may want to tune:--block-size
and--swap-space
i used a model meta-llama from huggingface_hub along with vllm eg:- llm = vllm.LLM(model=model_id,gpu_memory_utilization=0.25) im able to load it but when i try to load the fine-tuned model from meta-llama in my local repo im always getting an error wrong_path , eventhough the path is correct , how do i load vllm.LLM() for my locally available model
i'm new to vllm it would be really helpful if someone could give a guide on how to work on vllm for the finetuned locally stored model, since i there is no guide for this Thank You.
I see a prerequisite of uploading a trained transformer model on Hugging Face, can we instead serve our pre-trained transformer models saved locally in a directory