mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
24.1k stars 1.86k forks source link

bug: autogptq doesnt work (cant download model) #941

Open racerxdl opened 1 year ago

racerxdl commented 1 year ago

LocalAI version:

Docker Image: quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Environment, CPU architecture, OS, and Version: Running in TrueNAS Scale Kubernetes (k3s) with a NVidia Tesla P40 in the container.

# uname -a
Linux localai-ix-chart-f8bbbb7c7-x6xx9 6.1.42-production+truenas #2 SMP PREEMPT_DYNAMIC Mon Aug 14 23:21:26 UTC 2023 x86_64 GNU/Linux
# nvidia-smi
Tue Aug 22 16:36:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:23:00.0 Off |                    0 |
| N/A   28C    P8              10W / 250W |      2MiB / 23040MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
# cat /proc/cpuinfo |grep "model name" | nl
     1  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     2  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     3  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     4  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     5  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     6  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     7  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     8  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
     9  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    10  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    11  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    12  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    13  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    14  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    15  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    16  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    17  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    18  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    19  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
    20  model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
# cat /proc/meminfo  | grep Mem
MemTotal:       32701568 kB
MemFree:        18305148 kB
MemAvailable:   18767368 kB

Describe the bug AutoGPTQ added by #871 doesn't work in upstream container. Also tried exllama and gives a linker error for CudaSetDevice.

To Reproduce

curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=FileNotFoundError('Could not find model in TheBloke/orca_mini_v2_13b-GPTQ'), type(err)=\u003cclass 'FileNotFoundError'\u003e","type":""}}

Also tried with a local model:

name: deadbeef
backend: autogptq
parameters:
  model: wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors
curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "deadbeef",
     "messages": [{"role": "user", "content": "Give me a HTTP REST server made in rust that uses sqlite."}],
     "temperature": 0.9
   }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   655  100   489  100   166    115     39  0:00:04  0:00:04 --:--:--   154
{
  "error": {
    "code": 500,
    "message": "could not load model (no success): Unexpected err=OSError(\"wizardlm-13b-v1.1-GPTQ-4bit-128g.no-act.order.safetensors is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'\\nIf this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.\"), type(err)=<class 'OSError'>",
    "type": ""
  }
}

Expected behavior

Logs

2023-08-22 09:32:17.702437-07:004:32PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:TheBloke/orca_mini_v2_13b-GPTQ ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/TheBloke/orca_mini_v2_13b-GPTQ Device: UseTriton:false ModelBaseName:orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0}
2023-08-22 09:32:18.184608-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 126kB/s]
2023-08-22 09:32:20.337018-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:01<00:00, 400kB/s]
2023-08-22 09:32:20.789331-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 435/435 [00:00<00:00, 202kB/s]
2023-08-22 09:32:20.792331-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
2023-08-22 09:32:21.126540-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)lve/main/config.json:   0%|          | 0.00/812 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 812/812 [00:00<00:00, 151kB/s]
2023-08-22 09:32:21.646001-07:004:32PM DBG GRPC(TheBloke/orca_mini_v2_13b-GPTQ-127.0.0.1:35307): stderr 
Downloading (…)quantize_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]
Downloading (…)quantize_config.json: 100%|██████████| 158/158 [00:00<00:00, 74.0kB/s]
2023-08-22 09:32:21.818950-07:00[10.10.5.174]:36522  500  -  POST     /v1/chat/completions
mudler commented 1 year ago

did you tried with:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/orca_mini_v2_13b-GPTQ",
     "messages": [{"role": "user", "content": "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n \n### User: \ntell me about AI \n### Response:"}],
     "backend": "autogptq", "model_base_name": "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
}'

?

mudler commented 1 year ago

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

racerxdl commented 1 year ago

ah just saw you tried, my bad - it looks like a downloading issue to me. For local files, only exllama now works with local folders - what error you get there? also, did you tried with another model?

For exllama the error seens like a incompatible cuda version on the container:

ImportError: /usr/local/lib/python3.9/dist-packages/exllama_ext.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I also tried Vicuna and the TheBloke directly to download, they give the same not found errors. But for standard llama-cpp it downloads just fine (I tested the same models in GGML versions over llama-cpp and they work fine).