v1.7.0
with vLLM 0.6.3
now available under stable
tagsUpdate v1.7.0 is now available, use the image tag runpod/worker-v1-vllm:v1.7.0stable-cuda12.1.0
.
Deploy your own OpenAI-compatible Serverless Endpoint on RunPod with multiple embedding models and fast inference for RAG and more!
Worker vLLM is now cached on all RunPod machines, resulting in near-instant deployment! Previously, downloading and extracting the image took 3-5 minutes on average.
[!NOTE] You can now deploy from the dedicated UI on the RunPod console with all of the settings and choices listed. Try now by accessing in Explore or Serverless pages on the RunPod console!
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:
Below is a summary of the available RunPod Worker images, categorized by image stability and CUDA version compatibility.
CUDA Version | Stable Image Tag | Development Image Tag | Note |
---|---|---|---|
12.1.0 | runpod/worker-v1-vllm:v1.6.0stable-cuda12.1.0 |
runpod/worker-v1-vllm:v1.6.0dev-cuda12.1.0 |
When creating an Endpoint, select CUDA Version 12.3, 12.2 and 12.1 in the filter. |
Note:
0
is equivalent toFalse
and1
is equivalent toTrue
for boolean as int values.
Name |
Default |
Type/Choices |
Description |
---|---|---|---|
MODEL_NAME |
'facebook/opt-125m' | str |
Name or path of the Hugging Face model to use. |
TOKENIZER |
None | str |
Name or path of the Hugging Face tokenizer to use. |
SKIP_TOKENIZER_INIT |
False | bool |
Skip initialization of tokenizer and detokenizer. |
TOKENIZER_MODE |
'auto' | ['auto', 'slow'] | The tokenizer mode. |
TRUST_REMOTE_CODE |
False |
bool |
Trust remote code from Hugging Face. |
DOWNLOAD_DIR |
None | str |
Directory to download and load the weights. |
LOAD_FORMAT |
'auto' | str |
The format of the model weights to load. |
HF_TOKEN |
- | str |
Hugging Face token for private and gated models. |
DTYPE |
'auto' | ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'] | Data type for model weights and activations. |
KV_CACHE_DTYPE |
'auto' | ['auto', 'fp8'] | Data type for KV cache storage. |
QUANTIZATION_PARAM_PATH |
None | str |
Path to the JSON file containing the KV cache scaling factors. |
MAX_MODEL_LEN |
None | int |
Model context length. |
GUIDED_DECODING_BACKEND |
'outlines' | ['outlines', 'lm-format-enforcer'] | Which engine will be used for guided decoding by default. |
DISTRIBUTED_EXECUTOR_BACKEND |
None | ['ray', 'mp'] | Backend to use for distributed serving. |
WORKER_USE_RAY |
False | bool |
Deprecated, use --distributed-executor-backend=ray. |
PIPELINE_PARALLEL_SIZE |
1 | int |
Number of pipeline stages. |
TENSOR_PARALLEL_SIZE |
1 | int |
Number of tensor parallel replicas. |
MAX_PARALLEL_LOADING_WORKERS |
None | int |
Load model sequentially in multiple batches. |
RAY_WORKERS_USE_NSIGHT |
False | bool |
If specified, use nsight to profile Ray workers. |
ENABLE_PREFIX_CACHING |
False | bool |
Enables automatic prefix caching. |
DISABLE_SLIDING_WINDOW |
False | bool |
Disables sliding window, capping to sliding window size. |
USE_V2_BLOCK_MANAGER |
False | bool |
Use BlockSpaceMangerV2. |
NUM_LOOKAHEAD_SLOTS |
0 | int |
Experimental scheduling config necessary for speculative decoding. |
SEED |
0 | int |
Random seed for operations. |
NUM_GPU_BLOCKS_OVERRIDE |
None | int |
If specified, ignore GPU profiling result and use this number of GPU blocks. |
MAX_NUM_BATCHED_TOKENS |
None | int |
Maximum number of batched tokens per iteration. |
MAX_NUM_SEQS |
256 | int |
Maximum number of sequences per iteration. |
MAX_LOGPROBS |
20 | int |
Max number of log probs to return when logprobs is specified in SamplingParams. |
DISABLE_LOG_STATS |
False | bool |
Disable logging statistics. |
QUANTIZATION |
None | ['awq', 'squeezellm', 'gptq'] | Method used to quantize the weights. |
ROPE_SCALING |
None | dict |
RoPE scaling configuration in JSON format. |
ROPE_THETA |
None | float |
RoPE theta. Use with rope_scaling. |
TOKENIZER_POOL_SIZE |
0 | int |
Size of tokenizer pool to use for asynchronous tokenization. |
TOKENIZER_POOL_TYPE |
'ray' | str |
Type of tokenizer pool to use for asynchronous tokenization. |
TOKENIZER_POOL_EXTRA_CONFIG |
None | dict |
Extra config for tokenizer pool. |
ENABLE_LORA |
False | bool |
If True, enable handling of LoRA adapters. |
MAX_LORAS |
1 | int |
Max number of LoRAs in a single batch. |
MAX_LORA_RANK |
16 | int |
Max LoRA rank. |
LORA_EXTRA_VOCAB_SIZE |
256 | int |
Maximum size of extra vocabulary for LoRA adapters. |
LORA_DTYPE |
'auto' | ['auto', 'float16', 'bfloat16', 'float32'] | Data type for LoRA. |
LONG_LORA_SCALING_FACTORS |
None | tuple |
Specify multiple scaling factors for LoRA adapters. |
MAX_CPU_LORAS |
None | int |
Maximum number of LoRAs to store in CPU memory. |
FULLY_SHARDED_LORAS |
False | bool |
Enable fully sharded LoRA layers. |
SCHEDULER_DELAY_FACTOR |
0.0 | float |
Apply a delay before scheduling next prompt. |
ENABLE_CHUNKED_PREFILL |
False | bool |
Enable chunked prefill requests. |
SPECULATIVE_MODEL |
None | str |
The name of the draft model to be used in speculative decoding. |
NUM_SPECULATIVE_TOKENS |
None | int |
The number of speculative tokens to sample from the draft model. |
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE |
None | int |
Number of tensor parallel replicas for the draft model. |
SPECULATIVE_MAX_MODEL_LEN |
None | int |
The maximum sequence length supported by the draft model. |
SPECULATIVE_DISABLE_BY_BATCH_SIZE |
None | int |
Disable speculative decoding if the number of enqueue requests is larger than this value. |
NGRAM_PROMPT_LOOKUP_MAX |
None | int |
Max size of window for ngram prompt lookup in speculative decoding. |
NGRAM_PROMPT_LOOKUP_MIN |
None | int |
Min size of window for ngram prompt lookup in speculative decoding. |
SPEC_DECODING_ACCEPTANCE_METHOD |
'rejection_sampler' | ['rejection_sampler', 'typical_acceptance_sampler'] | Specify the acceptance method for draft token verification in speculative decoding. |
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD |
None | float |
Set the lower bound threshold for the posterior probability of a token to be accepted. |
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA |
None | float |
A scaling factor for the entropy-based threshold for token acceptance. |
MODEL_LOADER_EXTRA_CONFIG |
None | dict |
Extra config for model loader. |
PREEMPTION_MODE |
None | str |
If 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens. |
PREEMPTION_CHECK_PERIOD |
1.0 | float |
How frequently the engine checks if a preemption happens. |
PREEMPTION_CPU_CAPACITY |
2 | float |
The percentage of CPU memory used for the saved activations. |
DISABLE_LOGGING_REQUEST |
False | bool |
Disable logging requests. |
MAX_LOG_LEN |
None | int |
Max number of prompt characters or prompt ID numbers being printed in log. |
Tokenizer Settings
| TOKENIZER_NAME
| None
| str
|Tokenizer repository to use a different tokenizer than the model's default. |
| TOKENIZER_REVISION
| None
| str
|Tokenizer revision to load. |
| CUSTOM_CHAT_TEMPLATE
| None
| str
of single-line jinja template |Custom chat jinja template. More Info |
System, GPU, and Tensor Parallelism(Multi-GPU) Settings
| GPU_MEMORY_UTILIZATION
| 0.95
| float
|Sets GPU VRAM utilization. |
| MAX_PARALLEL_LOADING_WORKERS
| None
| int
|Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models. |
| BLOCK_SIZE
| 16
| 8
, 16
, 32
|Token block size for contiguous chunks of tokens. |
| SWAP_SPACE
| 4
| int
|CPU swap space size (GiB) per GPU. |
| ENFORCE_EAGER
| False | bool
|Always use eager-mode PyTorch. If False(0
), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. |
| MAX_SEQ_LEN_TO_CAPTURE
| 8192
| int
|Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.|
| DISABLE_CUSTOM_ALL_REDUCE
| 0
| int
|Enables or disables custom all reduce. |
Streaming Batch Size Settings:
| DEFAULT_BATCH_SIZE
| 50
| int
|Default and Maximum batch size for token streaming to reduce HTTP calls. |
| DEFAULT_MIN_BATCH_SIZE
| 1
| int
|Batch size for the first request, which will be multiplied by the growth factor every subsequent request. |
| DEFAULT_BATCH_SIZE_GROWTH_FACTOR
| 3
| float
|Growth factor for dynamic batch size. |
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE
, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR
. This will continue until the batch size reaches DEFAULT_BATCH_SIZE
. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, ...
. You can also specify this per request, with inputs max_batch_size
, min_batch_size
, and batch_size_growth_factor
. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker |
OpenAI Settings
| RAW_OPENAI_OUTPUT
| 1
| boolean as int
|Enables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility. |
| OPENAI_SERVED_MODEL_NAME_OVERRIDE
| None
| str
|Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model
parameter when making OpenAI requests |
| OPENAI_RESPONSE_ROLE
| assistant
| str
|Role of the LLM's Response in OpenAI Chat Completions. |
Serverless Settings
| MAX_CONCURRENCY
| 300
| int
|Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency |
| DISABLE_LOG_STATS
| False | bool
|Enables or disables vLLM stats logging. |
| DISABLE_LOG_REQUESTS
| False | bool
|Enables or disables vLLM request logging. |
[!TIP] If you are facing issues when using Mixtral 8x7B, Quantized models, or handling unusual models/architectures, try setting
TRUST_REMOTE_CODE
to1
.
To build an image with the model baked in, you must specify the following docker arguments when building the image.
MODEL_NAME
MODEL_REVISION
: Model revision to load (default: main
).BASE_PATH
: Storage directory where huggingface cache and model will be located. (default: /runpod-volume
, which will utilize network storage if you attach it or create a local directory within the image if you don't. If your intention is to bake the model into the image, you should set this to something like /models
to make sure there are no issues if you were to accidentally attach network storage.)QUANTIZATION
WORKER_CUDA_VERSION
: 12.1.0
(12.1.0
is recommended for optimal performance).TOKENIZER_NAME
: Tokenizer repository if you would like to use a different tokenizer than the one that comes with the model. (default: None
, which uses the model's tokenizer)TOKENIZER_REVISION
: Tokenizer revision to load (default: main
).For the remaining settings, you may apply them as environment variables when running the container. Supported environment variables are listed in the Environment Variables section.
sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg BASE_PATH="/models" .
If the model you would like to deploy is private or gated, you will need to include it during build time as a Docker secret, which will protect it from being exposed in the image and on DockerHub.
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token_here"
docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="openchat/openchat_3.5" .
Below are all supported model architectures (and examples of each) that you can deploy using the vLLM Worker. You can deploy any model on HuggingFace, as long as its base architecture is one of the following:
BAAI/AquilaChat2-7B
, BAAI/AquilaChat2-34B
, BAAI/Aquila-7B
, BAAI/AquilaChat-7B
, etc.)baichuan-inc/Baichuan2-13B-Chat
, baichuan-inc/Baichuan-7B
, etc.)bigscience/bloom
, bigscience/bloomz
, etc.)THUDM/chatglm2-6b
, THUDM/chatglm3-6b
, etc.)CohereForAI/c4ai-command-r-v01
, etc.)databricks/dbrx-base
, databricks/dbrx-instruct
etc.)Deci/DeciLM-7B
, Deci/DeciLM-7B-instruct
, etc.)tiiuae/falcon-7b
, tiiuae/falcon-40b
, tiiuae/falcon-rw-7b
, etc.)google/gemma-2b
, google/gemma-7b
, etc.)gpt2
, gpt2-xl
, etc.)bigcode/starcoder
, bigcode/gpt_bigcode-santacoder
, etc.)EleutherAI/gpt-j-6b
, nomic-ai/gpt4all-j
, etc.)EleutherAI/gpt-neox-20b
, databricks/dolly-v2-12b
, stabilityai/stablelm-tuned-alpha-7b
, etc.)internlm/internlm-7b
, internlm/internlm-chat-7b
, etc.)internlm/internlm2-7b
, internlm/internlm2-chat-7b
, etc.)core42/jais-13b
, core42/jais-13b-chat
, core42/jais-30b-v3
, core42/jais-30b-chat-v3
, etc.)meta-llama/Meta-Llama-3-8B-Instruct
, meta-llama/Meta-Llama-3-70B-Instruct
, meta-llama/Llama-2-70b-hf
, lmsys/vicuna-13b-v1.3
, young-geng/koala
, openlm-research/open_llama_13b
, etc.)openbmb/MiniCPM-2B-sft-bf16
, openbmb/MiniCPM-2B-dpo-bf16
, etc.)mistralai/Mistral-7B-v0.1
, mistralai/Mistral-7B-Instruct-v0.1
, etc.)mistralai/Mixtral-8x7B-v0.1
, mistralai/Mixtral-8x7B-Instruct-v0.1
, mistral-community/Mixtral-8x22B-v0.1
, etc.)mosaicml/mpt-7b
, mosaicml/mpt-30b
, etc.)allenai/OLMo-1B-hf
, allenai/OLMo-7B-hf
, etc.)facebook/opt-66b
, facebook/opt-iml-max-30b
, etc.)OrionStarAI/Orion-14B-Base
, OrionStarAI/Orion-14B-Chat
, etc.)microsoft/phi-1_5
, microsoft/phi-2
, etc.)microsoft/Phi-3-mini-4k-instruct
, microsoft/Phi-3-mini-128k-instruct
, etc.)Qwen/Qwen-7B
, Qwen/Qwen-7B-Chat
, etc.)Qwen/Qwen1.5-7B
, Qwen/Qwen1.5-7B-Chat
, etc.)Qwen/Qwen1.5-MoE-A2.7B
, Qwen/Qwen1.5-MoE-A2.7B-Chat
, etc.)stabilityai/stablelm-3b-4e1t
, stabilityai/stablelm-base-alpha-7b-v2
, etc.)bigcode/starcoder2-3b
, bigcode/starcoder2-7b
, bigcode/starcoder2-15b
, etc.)xverse/XVERSE-7B-Chat
, xverse/XVERSE-13B-Chat
, xverse/XVERSE-65B-Chat
, etc.)01-ai/Yi-6B
, 01-ai/Yi-34B
, etc.)The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are Chat Completions and Models - with both streaming and non-streaming.
Python (similar to Node.js, etc.):
When initializing the OpenAI Client in your code, change the api_key
to your RunPod API Key and the base_url
to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1
, filling in your deployed endpoint ID. For example, if your Endpoint ID is abc1234
, the URL would be https://api.runpod.ai/v2/abc1234/openai/v1
.
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
- After:
```python
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
)
model
parameter to your deployed model's name whenever using Completions or Chat Completions.
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
temperature=0,
max_tokens=100,
)
response = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
temperature=0,
max_tokens=100,
)
Using http requests:
Authorization
header to your RunPod API Key and the url
to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "Why is RunPod the best platform?"
}
],
"temperature": 0,
"max_tokens": 100
}'
curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR OPENAI API KEY>" \
-d '{
"model": "<YOUR DEPLOYED MODEL REPO/NAME>",
"messages": [
{
"role": "user",
"content": "Why is RunPod the best platform?"
}
],
"temperature": 0,
"max_tokens": 100
}'
When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:
First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:
from openai import OpenAI
import os
# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
)
This is the format used for GPT-4 and focused on instruction-following and chat. Examples of Open Source chat/instruct models include meta-llama/Llama-2-7b-chat-hf
, mistralai/Mixtral-8x7B-Instruct-v0.1
, openchat/openchat-3.5-0106
, NousResearch/Nous-Hermes-2-Mistral-7B-DPO
and more. However, if your model is a completion-style model with no chat/instruct fine-tune and/or does not have a chat template, you can still use this if you provide a chat template with the environment variable CUSTOM_CHAT_TEMPLATE
.
# Create a chat completion stream
response_stream = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
temperature=0,
max_tokens=100,
stream=True,
)
# Stream the response
for response in response_stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
# Create a chat completion
response = client.chat.completions.create(
model="<YOUR DEPLOYED MODEL REPO/NAME>",
messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
temperature=0,
max_tokens=100,
)
# Print the response
print(response.choices[0].message.content)
In the case of baking the model into the image, sometimes the repo may not be accepted as the model
in the request. In this case, you can list the available models as shown below and use that name.
models_response = client.models.list()
list_of_models = [model.id for model in models_response]
print(list_of_models)
Below are all available sampling parameters that you can specify in the sampling_params
dictionary. If you do not specify any of these parameters, the default values will be used.
The worker config is a JSON file that is used to build the form that helps users configure their serverless endpoint on the RunPod Web Interface.
Note: This is a new feature and only works for workers that use one model
The JSON consists of two main parts, schema and versions.
schema
: Here you specify the form fields that will be displayed to the user.
env_var_name
: The name of the environment variable that is being set using the form field.value
: This is the default value of the form field. It will be shown in the UI as such unless the user changes it.title
: This is the title of the form field in the UI.description
: This is the description of the form field in the UI.required
: This is a boolean that specifies if the form field is required.type
: This is the type of the form field. Options are:text
: Environment variable is a string so user inputs text in form field.select
: User selects one option from the dropdown. You must provide the options
key value pair after type if using this.toggle
: User toggles between true and false.number
: User inputs a number in the form field.options
: Specify the options the user can select from if the type is select
. DO NOT include this unless the type
is select
.versions
: This is where you call the form fields specified in schema
and organize them into categories.
imageName
: This is the name of the Docker image that will be used to run the serverless endpoint.minimumCudaVersion
: This is the minimum CUDA version that is required to run the serverless endpoint.categories
: This is where you call the keys of the form fields specified in schema
and organize them into categories. Each category is a toggle list of forms on the Web UI.title
: This is the title of the category in the UI.settings
: This is the array of settings schemas specified in schema
associated with the category.{
"schema": {
"TOKENIZER": {
"env_var_name": "TOKENIZER",
"value": "",
"title": "Tokenizer",
"description": "Name or path of the Hugging Face tokenizer to use.",
"required": false,
"type": "text"
},
"TOKENIZER_MODE": {
"env_var_name": "TOKENIZER_MODE",
"value": "auto",
"title": "Tokenizer Mode",
"description": "The tokenizer mode.",
"required": false,
"type": "select",
"options": [
{ "value": "auto", "label": "auto" },
{ "value": "slow", "label": "slow" }
]
},
...
}
}
{
"versions": {
"0.5.4": {
"imageName": "runpod/worker-v1-vllm:v1.2.0stable-cuda12.1.0",
"minimumCudaVersion": "12.1",
"categories": [
{
"title": "LLM Settings",
"settings": [
"TOKENIZER", "TOKENIZER_MODE", "OTHER_SETTINGS_SCHEMA_KEYS_YOU_HAVE_SPECIFIED_0", ...
]
},
{
"title": "Tokenizer Settings",
"settings": [
"OTHER_SETTINGS_SCHEMA_KEYS_0", "OTHER_SETTINGS_SCHEMA_KEYS_1", ...
]
},
...
]
}
}
}