rpc error: code = Unknown desc = unimplemented #800

allenhaozi opened 1 year ago

allenhaozi commented 1 year ago

What went wrong? Settings?


    "model": "llama-7b-hf",
    "messages": [
            "role": "user",
            "content": "Hello! What is your name?"


    "error": {
        "code": 500,
        "message": "rpc error: code = Unknown desc = unimplemented",
        "type": ""


Skipping rebuild
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
see the documentation at:
Note: See also
CPU info:
model name      : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
ESC[90m5:38AMESC[0m ESC[33mDBGESC[0m no galleries to load
ESC[90m5:38AMESC[0m ESC[32mINFESC[0m Starting LocalAI using 4 threads, with models path: /llm-model-volume
ESC[90m5:38AMESC[0m ESC[32mINFESC[0m LocalAI version: 12fe093 (12fe0932c41246914e455c4175269a431fb8cf60)
ESC[90m5:38AMESC[0m ESC[33mDBGESC[0m Extracting backend assets files to /tmp/localai/backend_data

 │                   Fiber v2.48.0                   │ 
 │                    │ 
 │       (bound on host and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 32  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 

6:21AM DBG Request received: 
6:21AM DBG Configuration read: &{PredictionOptions:{Model:llama-7b-hf Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
6:21AM DBG Parameters: &{PredictionOptions:{Model:llama-7b-hf Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
6:21AM DBG Prompt (before templating): Hello! What is your name?
6:21AM DBG Template failed loading: failed loading a template for llama-7b-hf
6:21AM DBG Prompt (after templating): Hello! What is your name?
6:21AM DBG Model already loaded in memory: llama-7b-hf
6:21AM DBG Model 'llama-7b-hf' already loaded
[]:43283  500  -  POST     /v1/chat/completions
allenhaozi commented 1 year ago

Deployed in k8s, the GPU has been configured, but it should not take effect

        cpu: "1"
        memory: 50Gi "4"
        cpu: "1"
        memory: 50Gi "4"
rozek commented 1 year ago

I have the same problem when running LocalAI in a Docker container. The logs contain numerous lines of the form:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp connect: connection refused"

with varying port numbers

rozek commented 1 year ago

FYI: the problem occurs both in local Docker builds and the ":latest" image from go-skynet

nabbl commented 1 year ago

Yes same here. used the latest version with GPT4All model and it just gives errors. Same on Kubernetes and local

Mer0me commented 1 year ago

If it can help, my (very similar) error message :

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "ggml-gpt4all-j",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 

{"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}
rozek commented 1 year ago

I am currently trying to compile a previous release in order to see until when LocalAI worked without this problem.

Unfortunately, the Docker build command seems to expect the source to have been checked-out as a Git project and refuses to build from an unpacked ZIP archive...

Thus, I directly checked out v1.21.0, built a Docker image locally, ran it...and had the same problem as before.

For the records: here is what I did

git clone --branch v1.21.0
cd LocalAI
docker build -t localai .
docker run --rm --name localai \
  -v "/path/to/your/local/models/folder":/build/models \
  -p \

I also tried v1.20.1 and v1.20.0 - but these builds failed with "ggml.c:(.text+0x2e860): multiple definition of `clear_numa_thread_affinity'; /build/go-llama/libbinding.a(ggml.o):ggml.c:(.text+0x2e860): first defined here"

Building v1.19.2 succeeded - but it could not load my model (LLaMA 2) which makes it useless for me...

I don't have the time to check every previous version, but perhaps somebody else has...

mudler commented 1 year ago

did you tried running with REBUILD=true? also please attach full logs with DEBUG=true

rozek commented 1 year ago

Ok, so I

with the same result as before. Here are the logs (mind the "skipping rebuild")

Skipping rebuild
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
see the documentation at:
Note: See also
CPU info:
CPU: no AVX    found
CPU: no AVX2   found
CPU: no AVX512 found
3:31AM DBG no galleries to load
3:31AM INF Starting LocalAI using 4 threads, with models path: /build/models
3:31AM INF LocalAI version: v1.22.0-19-gdde12b4 (dde12b492b2da4f14d66047a42b66bff80e223af)

 │                   Fiber v2.48.0                   │ 
 │                    │ 
 │       (bound on host and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 31  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 

here is my .env

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.

## Specify a different bind address (defaults to ":8080")

## Default models context size

## Define galleries.
## models will to install will be visible in `/models/available`
#GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"name":"huggingface","url":"github:go-skynet/model-gallery/huggingface.yaml"}]

## CORS settings

## Default path for models
# MODELS_PATH=/models

## Enable debug mode

## Specify a build type. Available: cublas, openblas, clblas.
# BUILD_TYPE=metal

## Uncomment and set to true to enable rebuilding from source

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
# GO_TAGS=stablediffusion

## Path where to store generated images

## Specify a default upload limit in MB (whisper)
rozek commented 1 year ago

here is the environment of the running container as reported by Docker (mind the "REBUILD=false")









Sean-McAuliffe commented 1 year ago

adding to this, same issues here both local docker & EKS via AL2 amd64

I can get through to /v1/models ok, but can't do anything with a model otherwise I get a timeout & various forms of:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:33427: connect: connection refused"

Sean-McAuliffe commented 1 year ago

seems it might just be an issue with the localAI image. building from scratch in a container works ok :


RUN yum install git -y RUN yum install golang -y RUN yum group install "Development Tools" -y RUN yum install cmake -y

RUN git clone


RUN make build

COPY . .


ENTRYPOINT [ "./local-ai", "--debug", "--models-path", "./models", "" ] `

swoh816 commented 1 year ago

@rozek @nabbl @Mer0me I had precisely the same error message as you had, so our problems may be the same. I inspected the usage of hardware resources by docker containers, and at least in my case, it was the memory limit issue. Docker Desktop (in Ubuntu 22.04) ships with a default memory limit smaller than the size of LLM (gpt4all in my case). So I set the memory limit 10GB, large enough to have gpt4all, and then it worked.

It was difficult to figure out it was the memory limit issue because the error message does not deliver it directly. Also, I don't know well about Docker, nor about LLMs, so it took some time for me to figure out the source of the problem in my machine. I think it will definitely help to include a note about increasing Docker's memory limit enough to have LLM on memory in the getting started page:

Note that I also uncommented REBUILD=true in .env file. Also, increasing the memory of docker by including --memory when running container did not help either. At least in my machine, I needed to increase it in the Docker Desktop application, and it seems like a common confusion (see

swoh816 commented 1 year ago

@allenhaozi Also given that your debug log says it failed to load the template, I wonder if it is the issue of (1) the wrong path set to find model template, or (2) not enough memory to load template.

allenhaozi commented 1 year ago

@allenhaozi Also given that your debug log says it failed to load the template, I wonder if it is the issue of (1) the wrong path set to find model template, or (2) not enough memory to load template.

@swoh816 , use image, got the following errors request:

    "model": "chatglm2-6b",
    "messages": [
            "role": "user",
            "content": "How are you?"
    "temperature": 0.9


    "error": {
        "code": 500,
        "message": "rpc error: code = Unknown desc = unimplemented",
        "type": ""


4:07AM DBG Request received: 
4:07AM DBG Configuration read: &{PredictionOptions:{Model:chatglm2-6b Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false NUMA:false Threads:1 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt: RMSNormEps:0 NGQA:0}
4:07AM DBG Parameters: &{PredictionOptions:{Model:chatglm2-6b Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false NUMA:false Threads:1 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt: RMSNormEps:0 NGQA:0}
4:07AM DBG Prompt (before templating): How are you?
4:07AM DBG Template failed loading: failed loading a template for chatglm2-6b
4:07AM DBG Prompt (after templating): How are you?
4:07AM DBG Model already loaded in memory: chatglm2-6b
4:07AM DBG Model 'chatglm2-6b' already loaded
[]:50320  500  -  POST     /v1/chat/completions
mokkin commented 1 year ago

I just followed the example and have the same issue here with docker-compose version 1.29.2, build unknown

chriswells0 commented 1 year ago

Increasing the memory as described by @swoh816 is what resolved this error for me.

Additionally, once that was fixed, text generation was extremely slow. The fix for that was to set threads equal to the number of CPU on the Kubernetes node.

Mathematinho commented 11 months ago

i increased the memory limit to 64 G still same message. i am using the example from "getting started".

when i uncommented REBUILD=true in .env file, i got the following error

curl: (56) Recv failure: Connection reset by peer

anything else i can try?

shankara-n commented 11 months ago

Could someone share what hardware/system configuration this does build and run successfully in?

kkkkkkjd commented 9 months ago

I followed the example and ran into the same problem here

TheRealAlexV commented 9 months ago

Also getting a similar issue here.


GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]


version: '3.6'
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    tty: true # enable colorized logs
    restart: always # should this be on-failure ?
      - 8080:8080
      - .env
      - ./models:/models
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]

Request & Error

…/AI/LocalAI שׂ master via 🐹 on ☁️ (us-east-1) 
🕙 19:58:46 ❯❯ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
                    "model": "llama2-7b-chat-gguf",
                    "messages": [{"role": "user", "content": "How are you?"}],
                    "temperature": 0.9 
{"error":{"code":500,"message":"could not load model: rpc error: code = Unknown desc = failed loading model","type":""}}⏎                                          

Container Logs

2023-12-03 19:58:44 12:58AM ERR error processing message {SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions Role:User: RoleName:user Content:How are you? MessageIndex:0} using template "llama2-7b-chat-gguf-chat": template: prompt:3:5: executing "prompt" at <.Input>: can't evaluate field Input in type model.ChatMessageTemplateData. Skipping!
2023-12-03 19:58:44 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp connect: connection refused"


2023-12-03 20:06:12 []:39930 200 - GET /readyz
2023-12-03 20:06:52 1:06AM DBG Request received: 
2023-12-03 20:06:52 1:06AM DBG Configuration read: &{PredictionOptions:{Model: Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b-chat-gguf F16:false Threads:8 Debug:true Roles:map[assistant:Assitant: assistant_function_call:Function Call: function:Function Result: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-7b-chat-gguf-chat Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:06:52 1:06AM DBG Parameters: &{PredictionOptions:{Model: Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b-chat-gguf F16:false Threads:8 Debug:true Roles:map[assistant:Assitant: assistant_function_call:Function Call: function:Function Result: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-7b-chat-gguf-chat Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:06:52 1:06AM ERR error processing message {SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions Role:User: RoleName:user Content:How are you? MessageIndex:0} using template "llama2-7b-chat-gguf-chat": template: prompt:3:5: executing "prompt" at <.Input>: can't evaluate field Input in type model.ChatMessageTemplateData. Skipping!
2023-12-03 20:06:52 1:06AM DBG Prompt (before templating): User:How are you?
2023-12-03 20:06:52 1:06AM DBG Template failed loading: failed loading a template for 
2023-12-03 20:06:52 1:06AM DBG Prompt (after templating): User:How are you?
2023-12-03 20:06:52 1:06AM DBG Loading model llama from 
2023-12-03 20:06:52 1:06AM DBG Stopping all backends except ''
2023-12-03 20:06:52 1:06AM DBG Loading model in memory from file: /models
2023-12-03 20:06:52 1:06AM DBG Loading Model  with gRPC (file: /models) (backend: llama): {backendString:llama model: threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001da5a0 externalBackends:map[autogptq:/build/backend/python/autogptq/ bark:/build/backend/python/bark/ diffusers:/build/backend/python/diffusers/ exllama:/build/backend/python/exllama/ huggingface-embeddings:/build/backend/python/sentencetransformers/ petals:/build/backend/python/petals/ sentencetransformers:/build/backend/python/sentencetransformers/ transformers:/build/backend/python/transformers/ vall-e-x:/build/backend/python/vall-e-x/ vllm:/build/backend/python/vllm/] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
2023-12-03 20:06:52 1:06AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
2023-12-03 20:06:52 1:06AM DBG GRPC Service for  will be running at: ''
2023-12-03 20:06:52 1:06AM DBG GRPC Service state dir: /tmp/go-processmanager3341423294
2023-12-03 20:06:52 1:06AM DBG GRPC Service Started
2023-12-03 20:06:53 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp connect: connection refused"
2023-12-03 20:06:53 1:06AM DBG GRPC(- stderr 2023/12/04 01:06:53 gRPC Server listening at
2023-12-03 20:06:55 1:06AM DBG GRPC Service Ready
2023-12-03 20:06:55 1:06AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model: ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr create_gpt_params_cuda: loading model /models
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr ggml_init_cublas: found 1 CUDA devices:
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr   Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr gguf_init_from_file: invalid magic number 00000000
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr error loading model: llama_model_loader: failed to load model from /models
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr 
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr llama_load_model_from_file: failed to load model
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr llama_init_from_gpt_params: error: failed to load model '/models'
2023-12-03 20:06:55 1:06AM DBG GRPC(- stderr load_binding_model: error: unable to load model
2023-12-03 20:06:55 []:54898 500 - POST /v1/chat/completions
2023-12-03 20:07:12 []:37870 200 - GET /readyz

If I change to the lunademo model from the model-gallery (also used in the model setup how-to), I get many more errors in debug:

2023-12-03 20:12:13 []:51240 200 - GET /readyz
2023-12-03 20:12:28 1:12AM DBG Request received: 
2023-12-03 20:12:28 1:12AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat-message ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:12:28 1:12AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat-message ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:12:28 1:12AM DBG Prompt (before templating): How are you?
2023-12-03 20:12:28 1:12AM DBG Template found, input modified to: How are you?
2023-12-03 20:12:28 
2023-12-03 20:12:28 ASSISTANT:
2023-12-03 20:12:28 
2023-12-03 20:12:28 1:12AM DBG Prompt (after templating): How are you?
2023-12-03 20:12:28 
2023-12-03 20:12:28 ASSISTANT:
2023-12-03 20:12:28 
2023-12-03 20:12:28 1:12AM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:28 1:12AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf'
2023-12-03 20:12:28 1:12AM DBG [single-backend] Stopping 
2023-12-03 20:12:28 1:12AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:28 1:12AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001da5a0 externalBackends:map[autogptq:/build/backend/python/autogptq/ bark:/build/backend/python/bark/ diffusers:/build/backend/python/diffusers/ exllama:/build/backend/python/exllama/ huggingface-embeddings:/build/backend/python/sentencetransformers/ petals:/build/backend/python/petals/ sentencetransformers:/build/backend/python/sentencetransformers/ transformers:/build/backend/python/transformers/ vall-e-x:/build/backend/python/vall-e-x/ vllm:/build/backend/python/vllm/] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
2023-12-03 20:12:28 1:12AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
2023-12-03 20:12:28 1:12AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: ''
2023-12-03 20:12:28 1:12AM DBG GRPC Service state dir: /tmp/go-processmanager3150385545
2023-12-03 20:12:28 1:12AM DBG GRPC Service Started
2023-12-03 20:12:28 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp connect: connection refused"
2023-12-03 20:12:28 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf- stderr 2023/12/04 01:12:28 gRPC Server listening at
2023-12-03 20:12:30 1:12AM DBG GRPC Service Ready
2023-12-03 20:12:30 1:12AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf- stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf- stderr ggml_init_cublas: found 1 CUDA devices:
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf- stderr   Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
2023-12-03 20:13:13 []:34806 200 - GET /readyz
benm5678 commented 9 months ago

Is there any solution/workaround? I get the same error with various models when deployed to EKS (locally I can run it fine on minikube).

chris-hatton commented 9 months ago

Another frustrated user here; I can't get anything to work, including the 'Getting started' instructions. Trying a cublas build with Docker. I get the feeling the Local.AI architecture is failing to surface errors from the back-end that would tell the problem. Requested #1416

FarhanSajid1 commented 9 months ago

same issue

benm5678 commented 7 months ago

In case it helps, I was facing similar errors trying to host llama2 model on AWS EKS with A10 gpu. First we upgraded Nvidia to latest 5.* driver. Second, I needed to also deploy a model yaml file to set f16/gpu_layers (it wasn't enough just to have those as env params as the helm chart pushes). The LocalAI API methods below help you do it all easily -- can search for a model via their gallery & push it with the settings you want (here you can also specify the backend, so it doesn't guess):

Get available from gallery

curl http://localhost:8000/models/available | jq '.[] | select(.name | contains("llama2"))'

Install from gallery

curl http://localhost:8000/models/apply -H "Content-Type: application/json" -d '{
     "id": "thebloke__llama2-chat-ayt-13b-gguf__llama2-chat-ayt-13b.q5_k_s.gguf",
     "overrides": {
        "backend": "llama",
        "f16": true,
        "gpu_layers": 43
jeryaiwei commented 5 months ago

@mudler 1 not unimplemented grpc::Status Embedding(ServerContext context, const backend::PredictOptions request, backend::EmbeddingResult* reply) method. if backend =llama-cpp: { "error": { "code": 500, "message": "rpc error: code = Unknown desc = unimplemented", "type": "" } } Bert.cpp has been integrated into llama.cpp! See and the discussions Updated forks: iamlemec/bert.cpp xyzhang626/embeddings.cpp 2 backend/go/llm/llama not used