mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.2k stars 1.76k forks source link

only 4 threads are used #498

Closed badsmoke closed 1 year ago

badsmoke commented 1 year ago

LocalAI version: latest docker image

Environment, CPU architecture, OS, and Version: Ryzen 9 3900X -> 12 Cores 24 Threads windows 10 -> wsl (5.15.90.1-microsoft-standard-WSL2 ) docker

Describe the bug i have the model ggml-gpt4all-l13b-snoozy.bin but only a maximum of 4 threads are used.

docker-compose.yml

version: "3.9"
services:
  localai:
    image: quay.io/go-skynet/local-ai:latest
    volumes:
      - ./models:/models
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "8080:8080"
    command:
       --models-path /models
       --context-size 1024
       --threads 23
       --debug

model file models/gpt-3.5-turbo.yaml

name: gpt-3.5-turbo
# Default model parameters
parameters:
  # Relative to the models path
  model: ggml-gpt4all-l13b-snoozy.bin
  # temperature
  temperature: 0.3
  # all the OpenAI request options here..

# Default context size
context_size: 512
threads: 23
# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.
#backend: gptj # available: llama, stablelm, gpt2, gptj rwkv
# stopwords (if supported by the backend)
stopwords:
- "HUMAN:"
- "### Response:"
# define chat roles
roles:
  user: "HUMAN:"
  system: "GPT:"
template:
  # template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the files
  completion: completion
  chat: ggml-gpt4all-l13b-snoozy

template

The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
{{.Input}}
### Response:

e.g. command

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-l13b-snoozy.bin",
     "messages": [{"role": "user", "content": "What is Kubernetes?"}],
     "temperature": 0.7
   }'

thanks

mudler commented 1 year ago

good catch, thanks for filing up an issue!

that looks a regression, somehow the patch got lost when upstreaming the binding. I've opened up a PR upstream too: https://github.com/nomic-ai/gpt4all/pull/836

badsmoke commented 1 year ago

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin
mudler commented 1 year ago

should your branch itself be executable?

I have built a new version with it, but unfortunately comes an error that he can not load the model

localai_1  | 9:00AM DBG Request received: {"model":"gpt-3.5-turbo","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"system","content":"You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown."},{"role":"user","content":"hey"}],"stream":true,"echo":false,"top_p":0,"top_k":0,"temperature":0.5,"max_tokens":1000,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"seed":0,"mode":0,"step":0}
localai_1  | 9:00AM DBG Parameter Config: &{OpenAIRequest:{Model:ggml-gpt4all-l13b-snoozy.bin File: Language: ResponseFormat: Size: Prompt:<nil> Instruction: Input:<nil> Stop:<nil> Messages:[] Stream:false Echo:false TopP:0 TopK:0 Temperature:0.5 Maxtokens:1000 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 Seed:0 Mode:0 Step:0} Name:gpt-3.5-turbo StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false Threads:20 Debug:true Roles:map[system:GPT: user:HUMAN:] Embeddings:false Backend: TemplateConfig:{Completion:completion Chat:ggml-gpt4all-l13b-snoozy Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptStrings:[] InputStrings:[] InputToken:[]}
localai_1  | 9:00AM DBG Stream request received
localai_1  | 9:00AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
localai_1  | ### Prompt:
localai_1  | GPT: You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully. Respond using markdown.
localai_1  | HUMAN: hey
localai_1  | ### Response:
localai_1  | 
localai_1  | [172.31.48.1]:62236  200  -  POST     /v1/chat/completions
localai_1  | 9:00AM DBG Loading model 'ggml-gpt4all-l13b-snoozy.bin' greedly
localai_1  | 9:00AM DBG [llama] Attempting to load
localai_1  | 9:00AM DBG Loading model llama from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Sending chunk: {"object":"chat.completion.chunk","model":"gpt-3.5-turbo","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
localai_1  | 
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | 9:00AM DBG [llama] Fails: failed loading model
localai_1  | 9:00AM DBG [gpt4all] Attempting to load
localai_1  | 9:00AM DBG Loading model gpt4all from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                           <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                             <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1  | llama_init_from_file: failed to load model
localai_1  | LLAMA ERROR: failed to load model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG [gpt4all] Fails: failed loading model
localai_1  | 9:00AM DBG [gptneox] Attempting to load
localai_1  | 9:00AM DBG Loading model gptneox from ggml-gpt4all-l13b-snoozy.bin
localai_1  | 9:00AM DBG Loading model in memory from file: /models/ggml-gpt4all-l13b-snoozy.bin
localai_1  | gpt_neox_model_load: invalid model file '/models/ggml-gpt4all-l13b-snoozy.bin' (bad magic)
localai_1  | gpt_neox_bootstrap: failed to load model from '/models/ggml-gpt4all-l13b-snoozy.bin

thanks for the heads up, I've included a fix for this issue in https://github.com/go-skynet/LocalAI/pull/507

badsmoke commented 1 year ago

thank you, the changes work for me.

but unfortunately the output is still extremely slow.

if i run this on the same device with the same model and the software from gpt4all, there is an instant response (3seconds), with localai 1minutes, this is a huge difference

mudler commented 1 year ago

You shouldn't overbook threads, but rather match the number of physical cores. Here I get 120ms per token, but my hardware ain't much capable either

badsmoke commented 1 year ago

I have also tried it with 12 cores (actual number of pyhsic cores), but it is the same resultis about 1 minute, until even the first word arrives

mudler commented 1 year ago

here with 8threads ggml-gpt4all-l13b-snoozy.bin takes about 15s to reply to "How are you?" (note meanwhile my CPU was busy doing other things):

Peek 2023-06-05 13-45

also note that the model will be loaded into memory at the first usage, so it will be slightly slower on the first call. Where are you running LocalAI?

badsmoke commented 1 year ago

i run it via docker, something seems to go wrong it doesn't look like anything is loaded into memory.

when i make a request, nothing really happens for 1 minute, and then the first tokens arrive

localai_1    |  ┌───────────────────────────────────────────────────┐ 
localai_1    |  │                   Fiber v2.46.0                   │ 
localai_1    |  │               http://127.0.0.1:8080               │ 
localai_1    |  │       (bound on host 0.0.0.0 and port 8080)       │ 
localai_1    |  │                                                   │ 
localai_1    |  │ Handlers ............ 23  Processes ........... 1 │ 
localai_1    |  │ Prefork ....... Disabled  PID .............. 3890 │ 
localai_1    |  └───────────────────────────────────────────────────┘ 
localai_1    | 
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | error loading model: llama.cpp: tensor '�~��T��zuƝW�z���$�;�GY�������>\��uƷm�;̈́ȅhLc������m��I��;�]V��zWkxȅzT\�m
                                                                                                                             <���wn��wkZ֨˺�K۶<�ukz��ww����wvV��m
                                                                                                                                                               <���騳V�E�����Y�۶�;��K�s���V�gV���m<wih���x�vL�X��J��m�;|�?Y_��Ew���Df(ܒ$�;����l��X�yQ�w�)' should not be 1003786825-dimensional
localai_1    | llama_init_from_file: failed to load model
localai_1    | llama.cpp: loading model from /models/ggml-gpt4all-l13b-snoozy.bin
localai_1    | gptjllama_model_load_internal: format     = ggjt v1 (latest)
localai_1    | gptjllama_model_load_internal: n_vocab    = 32000
localai_1    | gptjllama_model_load_internal: n_ctx      = 2048
localai_1    | gptjllama_model_load_internal: n_embd     = 5120
localai_1    | gptjllama_model_load_internal: n_mult     = 256
localai_1    | gptjllama_model_load_internal: n_head     = 40
localai_1    | gptjllama_model_load_internal: n_layer    = 40
localai_1    | gptjllama_model_load_internal: n_rot      = 128
localai_1    | gptjllama_model_load_internal: ftype      = 2 (mostly Q4_0)
localai_1    | gptjllama_model_load_internal: n_ff       = 13824
localai_1    | gptjllama_model_load_internal: n_parts    = 1
localai_1    | gptjllama_model_load_internal: model size = 13B
localai_1    | gptjllama_model_load_internal: ggml ctx size =  73.73 KB
localai_1    | gptjllama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
localai_1    | gptjllama_init_from_file: kv self size  = 1600.00 MB
boixu commented 1 year ago

I get the same tensor error