AutoGPTQ backend can't load local model files

LocalAI version: Docker image: localai/localai:v2.9.0-cublas-cuda12-core with extra backend autogptq

Environment, CPU architecture, OS, and Version:

# nvidia-smi
Fri Mar  8 05:21:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     On  | 00000000:F0:00.0 Off |                    0 |
|  0%   29C    P8              15W / 150W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10                     On  | 00000000:F1:00.0 Off |                    0 |
|  0%   29C    P8              15W / 150W |      2MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Describe the bug Trying to start Qwen-VL-Chat-int4 model, but failed due to autogptq can't find the config.json in the model folder.

To Reproduce

Build docker image with Dockerfile:


FROM localai/localai:v2.9.0-cublas-cuda12-core

RUN apt-get update -y && apt-get install -y curl gcc libxml2 libxml2-dev RUN apt install -y wget git && \ apt clean && \ rm -rf /var/lib/apt/lists/ /tmp/ /var/tmp/*

ENV PATH="/root/miniconda3/bin:${PATH}" ARG PATH="/root/miniconda3/bin:${PATH}"

RUN wget \ https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \ && mkdir .conda \ && bash Miniconda3-latest-Linux-x86_64.sh -b \ && rm -f Miniconda3-latest-Linux-x86_64.sh RUN conda init bash

RUN PATH=$PATH:/opt/conda/bin make -C backend/python/autogptq ENV EXTERNAL_GRPC_BACKENDS="autogptq:/build/backend/python/autogptq/run.sh" ENV BUILD_TYPE="cublas"

2. Downlaod the model files to local drive:
`huggingface-cli download --resume-download Qwen/Qwen-VL-Chat-Int4 --local-dir qwen-vl-chat-int4 --local-dir-use-symlinks False`

3. Create `qwen-vl.yaml` file
```yaml
  # Model name.
  # The model name is used to identify the model in the API calls.
- name: gpt-4-vision-preview
  # Default model parameters.
  # These options can also be specified in the API calls
  parameters:
    model: qwen-vl-chat-int4
    temperature: 0.7
    top_k: 85
    top_p: 0.7

  # Default context size
  context_size: 4096
  # Default number of threads
  threads: 16
  backend: autogptq

  # define chat roles
  roles:
    user: "user:"
    assistant: "assistant:"
    system: "system:"
  template:
    chat: &template |
      Instruct: {{.Input}}
      Output:
    # Modify the prompt template here ^^^ as per your requirements
    completion: *template 
  # Enable F16 if backend supports it
  f16: true
  embeddings: false
  # Enable debugging
  debug: true

  # GPU Layers (only used when built with cublas)
  gpu_layers: -1

  # Diffusers/transformers
  cuda: true

Run the model: docker run -p 8080:8080 -v $PWD/models:/opt/models -e MODELS_PATH=/opt/models localai:v2.9.0-autogptq --config-file /opt/models/qwen-vl.yaml

Call the API

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-4-vision-preview",
"messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

Expected behavior Respond with answers.

Logs

{
  "error": {
    "code": 500,
    "message": "could not load model (no success): Unexpected err=OSError(\"We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like qwen-vl-chat-int4 is not the path to a directory containing a file named config.json.\\nCheckout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.\"), type(err)=<class 'OSError'>",
    "type": ""
  }
}

mudler / LocalAI

AutoGPTQ backend can't load local model files #1812