Allow passing hf config args with openai server

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

27.1k stars 3.98k forks source link

Allow passing hf config args with openai server #2547

Open Aakash-kaushik opened 8 months ago

Aakash-kaushik commented 8 months ago

Hi,

Is there a specific reason for why can't we allow passing of args from the openai server to the HF config class, there are very reasonable use cases where i would want to override the existing args in a config while running the model dynamically though the server.

reference line

simply allowing *args in the openai server that are passed to this while loading the model, i believe there are internal checks for failing if anything configured is wrong anyway.

supported documentation in the transformers library:

        >>> # Change some config attributes when loading a pretrained config.
        >>> config = AutoConfig.from_pretrained("bert-base-uncased", output_attentions=True, foo=False)
        >>> config.output_attentions
        True

simon-mo commented 8 months ago

I believe there's no fundamental reason to this. Contribution welcomed! I would say you can add this to ModelConfig class and pass it through EngineArgs.

KrishnaM251 commented 7 months ago

I will take a look at this

mrPsycox commented 7 months ago

Anyone has news about that? I want to use --dtype, but it doesn't work

Aakash-kaushik commented 7 months ago

@mrPsycox —dtype is supported in vllm, please take a look at the engine args on vllm docs

mrPsycox commented 7 months ago

Thanks @Aakash-kaushik , I found the issue. Passing --dtype need to be in first args of the command, not in the last.

This works for me:

 run: |
   conda activate vllm
   python -m vllm.entrypoints.openai.api_server \
     --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
     --dtype half \
     --host 0.0.0.0 --port 8080 \
     --model <model_ name>

timbmg commented 4 months ago

Just as a workaround, I am currently doing something like this:

import shutil
import os
from contextlib import contextmanager

@contextmanager
def swap_files(file1, file2):
    try:

        temp_file1 = file1 + '.temp'
        temp_file2 = file2 + '.temp'

        print("Renaming Files.")
        os.rename(file1, temp_file1)
        os.rename(file2, file1)
        os.rename(temp_file1, file2)

        yield

    finally:
        print("Restoring Files.")
        os.rename(file2, temp_file2)
        os.rename(file1, file2)
        os.rename(temp_file2, file1)

file1 = '/path/to/original/config.json'
file2 = '/path/to/modified/config.json'

with swap_files(file1, file2):
    llm = LLM(...)

K-Mistele commented 4 months ago

I would love to see this as well

KrishnaM251 commented 2 months ago

@Aakash-kaushik @mrPsycox @timbmg @K-Mistele

Please take a look at my PR and let me know if it serves your purpose.

As @DarkLight1337 noted in my PR (#5836) , what exactly do you want to accomplish using this feature that cannot otherwise be done via vLLM args? (If we don't have any situation that results in different vLLM output, what is the point of enabling this?)

Once you get back to me, I'll write a test that covers that case.