vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.27k stars 4.59k forks source link

vLLM to add a locally trained model #1131

Closed atanikan closed 8 months ago

atanikan commented 1 year ago

I see a prerequisite of uploading a trained transformer model on Hugging Face, can we instead serve our pre-trained transformer models saved locally in a directory

viktor-ferenczi commented 1 year ago

Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work.

To serve vLLM API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

Serve OpenAI API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.openai.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

To run on multiple GPUs add: --tensor-parallel-size=N where N is the number of GPUs

The tokenizer above works at least for the Llama 2 based models. It results in faster startup time: --tokenizer=hf-internal-testing/llama-tokenizer

Additional parameters you may want to tune: --block-size and --swap-space

blackhawk-17 commented 1 year ago

How can I add locally saved models from within python code? What parameter should I use to specify the local model here LLM(model='model_name')?

viktor-ferenczi commented 1 year ago

Pass the absolute path of your model directory in the model parameter, that should work. I use it that way all the time. Yeah, the documentation is not too clear about it.

If you want to use a path relative to your home directory, then you can do this:

model_dir = os.path.expanduser('~/models/Some/Model')
llm = LLM(model=model_dir, ...)
Imccccc commented 1 year ago

Is it possible to provide a "model dir" which contains a lot of pre-trained models, and I can specify a model name load from "model dir". vLLM openai.api_server use model parameter as model name.

viktor-ferenczi commented 1 year ago

No. You need to provide the path to the model directory with the actual model files (config.json, etc) in it. It would be possible to provide a --model-base-dir or something like that, but what vLLM would do is just joining the base path with the model ID, so it is not much of a value for added complexity.

atanikan commented 1 year ago

@viktor-ferenczi We have a downloaded version of llama-2-70b.

I see an error : llama-2-70b-chat does not appear to have a file named config.json

(2022-07-01/vllm_conda_env) atanikanti@thetagpu02:/lus/grand/projects/datascience/atanikanti/vllm_service/vllm_serve$ ./serve.sh llama-2-70b-chat /eagle/datascience/venkatv/datasets/llama
Traceback (most recent call last):
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 436, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/arg_utils.py", line 153, in create_engine_configs
    model_config = ModelConfig(self.model, self.tokenizer,
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/config.py", line 62, in __init__
    self.hf_config = get_config(model, trust_remote_code)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 17, in get_config
    config = AutoConfig.from_pretrained(
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1023, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 675, in _get_config_dict
    resolved_config_file = cached_file(
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/utils/hub.py", line 400, in cached_file
    raise EnvironmentError(
OSError: /eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat does not appear to have a file named config.json. Checkout 'https://huggingface.co//eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat/None' for available files.

Does it have to be a hugging face model?

LiuXiaoxuanPKU commented 1 year ago

@viktor-ferenczi We have a downloaded version of llama-2-70b.

I see an error : llama-2-70b-chat does not appear to have a file named config.json

(2022-07-01/vllm_conda_env) atanikanti@thetagpu02:/lus/grand/projects/datascience/atanikanti/vllm_service/vllm_serve$ ./serve.sh llama-2-70b-chat /eagle/datascience/venkatv/datasets/llama
Traceback (most recent call last):
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 436, in from_engine_args
    engine_configs = engine_args.create_engine_configs()
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/engine/arg_utils.py", line 153, in create_engine_configs
    model_config = ModelConfig(self.model, self.tokenizer,
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/config.py", line 62, in __init__
    self.hf_config = get_config(model, trust_remote_code)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/vllm/transformers_utils/config.py", line 17, in get_config
    config = AutoConfig.from_pretrained(
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 1023, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 620, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/configuration_utils.py", line 675, in _get_config_dict
    resolved_config_file = cached_file(
  File "/lus/grand/projects/datascience/atanikanti/envs/vllm_conda_env/lib/python3.9/site-packages/transformers/utils/hub.py", line 400, in cached_file
    raise EnvironmentError(
OSError: /eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat does not appear to have a file named config.json. Checkout 'https://huggingface.co//eagle/datascience/venkatv/datasets/llama/llama-2-70b-chat/None' for available files.

Does it have to be a hugging face model?

Yes, currently vllm requires it to be a HF model. Related code: https://github.com/vllm-project/vllm/blob/6b5296aa3ae632b8f2dcbc78579eb41b28e41068/vllm/transformers_utils/config.py#L30

Suvralipi commented 5 months ago

Is it possible to load models saved locally (same format as the supported vLLM format/HF model types) ?

preethiisenthil commented 1 month ago

Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work.

To serve vLLM API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

Serve OpenAI API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.openai.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

To run on multiple GPUs add: --tensor-parallel-size=N where N is the number of GPUs

The tokenizer above works at least for the Llama 2 based models. It results in faster startup time: --tokenizer=hf-internal-testing/llama-tokenizer

Additional parameters you may want to tune: --block-size and --swap-space

i used a model meta-llama from huggingface_hub along with vllm eg:- llm = vllm.LLM(model=model_id,gpu_memory_utilization=0.25) im able to load it but when i try to load the fine-tuned model from meta-llama in my local repo im always getting an error wrong_path , eventhough the path is correct , how do i load vllm.LLM() for my locally available model

preethiisenthil commented 1 month ago

Specify the local folder you have the model in instead of a HF model ID. If you have all the necessary files and the model is using a supported architecture, then it will work. To serve vLLM API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

Serve OpenAI API:

#!/bin/bash
MODEL_NAME="$1"
test -n "$MODEL_NAME"
MODEL_DIR="$HOME/models/$MODEL_NAME"
test -d "$MODEL_DIR"
python -O -u -m vllm.entrypoints.openai.api_server \
    --host=127.0.0.1 \
    --port=8000 \
    --model=$HOME/models/$MODEL_NAME \
    --tokenizer=hf-internal-testing/llama-tokenizer

To run on multiple GPUs add: --tensor-parallel-size=N where N is the number of GPUs The tokenizer above works at least for the Llama 2 based models. It results in faster startup time: --tokenizer=hf-internal-testing/llama-tokenizer Additional parameters you may want to tune: --block-size and --swap-space

i used a model meta-llama from huggingface_hub along with vllm eg:- llm = vllm.LLM(model=model_id,gpu_memory_utilization=0.25) im able to load it but when i try to load the fine-tuned model from meta-llama in my local repo im always getting an error wrong_path , eventhough the path is correct , how do i load vllm.LLM() for my locally available model

i'm new to vllm it would be really helpful if someone could give a guide on how to work on vllm for the finetuned locally stored model, since i there is no guide for this Thank You.