vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.13k stars 4.35k forks source link

[Usage]: running gated models offline #9255

Open SamuelBG13 opened 2 weeks ago

SamuelBG13 commented 2 weeks ago

Your current environment

I use vllm=0.6.2 installed via pip :-)

How would you like to use vllm

Hello!

First of all, thanks for your great service to the community! I appreciate the work you put on this package.

I am currently running models with the vLLM server. I am particularly interested in a gated model I have access to, so I followed the Huggingface Hub instructions for setting a token, downloaded the weights and ran the model successfully. I used:

vllm serve {model_name} --someotherargs --download-dir /some_local_directory

Until then all good. However, if I want to serve the model without a HF Hub connection (e.g. with no internet or on a fresh session with no HF_TOKEN) I cannot serve it, although the model is downloaded locally:

Cannot access gated repo for url https://huggingface.co/ blablabla
Access to model {model_id} is restricted. You must have access to it and be authenticated to access it. Please log in.

Of course, setting the HF_TOKEN again lets me serve the model (it does not download the weights again). But this is a bit of a bummer as I would like to use the server in local applications, regardless of the internet connection. Imagine you have an important event and the internet connection is bad or precisely that day the HF servers crash. Am I misunderstanding the usage, or is this a bug?

Things I tried: setting HF_HOME and HF_HUB_CACHE to the local directory with the model does not work either.

DarkLight1337 commented 2 weeks ago

Have you tried setting HF_HUB_OFFLINE environment variable?

SamuelBG13 commented 2 weeks ago

Yes, I had tried that too. Then I get:

huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/api/models/{MODEL_REPO}: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.
robertgshaw2-neuralmagic commented 2 weeks ago

Can you share the stack trace so we can see which LOC throws the error?

SamuelBG13 commented 2 weeks ago

Hello! Apologies, I was thinking it was perhaps the expected behavior, hence I didn't file a bug issue properly.

Here is the traceback when I unset my HF_TOKEN:

INFO api_server.py:177] Started engine process with PID (...)
Traceback (most recent call last):
  File "/packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/mistralai/Pixtral-12B-2409/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/packages/(...)/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/packages/vllm/scripts.py", line 37, in serve
    uvloop.run(run_server(args))
  File "/packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "/packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/packages/vllm/entrypoints/openai/api_server.py", line 182, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/packages/vllm/engine/arg_utils.py", line 874, in create_engine_config
    model_config = self.create_model_config()
  File "/packages/vllm/engine/arg_utils.py", line 811, in create_model_config
    return ModelConfig(
  File "/packages/vllm/config.py", line 183, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/packages/vllm/transformers_utils/config.py", line 121, in get_config
    if is_gguf or file_or_path_exists(model,
  File "/packages/vllm/transformers_utils/config.py", line 96, in file_or_path_exists
    return file_exists(model, config_name, revision=revision, token=token)
  File "/packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/packages/huggingface_hub/hf_api.py", line 2641, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
    r = _request_wrapper(
  File "/packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
    response = _request_wrapper(
  File "/packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
    hf_raise_for_status(response)
  File "/packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_status
    raise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error.

And as mentioned previously, mistralai/Pixtral-12B-2409 is already downloaded 😄

Using:

huggingface-hub==0.23.3
vllm==0.6.2 
transformers==4.45.2
ycool commented 2 weeks ago

@SamuelBG13 Can you help to add the output of python collect_env.py and env > env.txt ?

By adding export HF_HUB_OFFLINE=1 and manually offline internet, it works in my env.