mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.19k stars 1.76k forks source link

"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}} #1909

Open rohan902 opened 5 months ago

rohan902 commented 5 months ago

LocalAI version: Latest

Environment, CPU architecture, OS, and Version: EC-2

Describe the bug Getting the grpc connection error when running using cuda12 image. But when running through Vanilla/cpu image, its working fine. Using docker-compose to start the server.

To Reproduce curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "luna-ai-llama2", "prompt": "A long time ago in a galaxy far, far away","temperature": 0.7}'

Expected behavior I need to run llm on GPU for inference, tried all images available but still same error persists

Logs

12:08PM INF Trying to load the model 'luna-ai-llama2' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/diffusers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/vall-e-x/run.sh 12:08PM INF [llama-cpp] Attempting to load 12:08PM INF Loading model 'luna-ai-llama2' with backend llama-cpp 12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37313:connect: connection refused" 12:09PM INF [llama-cpp] Fails: grpc service not ready 12:09PM INF [llama-ggml] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend llama-ggml 12:09PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 12:09PM INF [gpt4all] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend gpt4all 12:09PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 12:09PM INF [bert-embeddings] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend bert-embeddings 12:09PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model 12:09PM INF [rwkv] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend rwkv 12:09PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF 12:09PM INF [whisper] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend whisper 12:09PM ERR Failed starting/connecting to the gRPC service: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35143:connect: connection refused" 12:09PM INF [whisper] Fails: grpc service not ready 12:09PM INF [stablediffusion] Attempting to load 12:09PM INF Loading model 'luna-ai-llama2' with backend

Additional context I think people have faced similar problem earlier also but I couldn't find any solution. Kindly let me know if someone have any workarounds!!!!!

Anto79-ops commented 5 months ago

hi, I can confirm im getting the same issue on master (it was pulled after v2.11 cuda cublas12-ffmpeg images became available)

2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM DBG Stopping all backends except 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf'
2:46PM INF Trying to load the model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/exllama2/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/diffusers/run.sh
2:46PM INF [llama-cpp] Attempting to load
2:46PM INF Loading model 'laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf' with backend llama-cpp
2:46PM DBG Model already loaded in memory: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
2:46PM WRN GRPC Model not responding: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38737: connect: connection refused"
2:46PM WRN Deleting the process in order to recreate it
2:46PM DBG GRPC Process is not responding: laser-dolphin-mixtral-2x7b-dpo.Q6_K.gguf
mr-v-v-v commented 5 months ago

I confirm the same issue. it's critical

mudler commented 5 months ago

Can you please share the logs with DEBUG=true? also, how are you using the image? is with a GPU I suppose?

Anto79-ops commented 5 months ago

Hello @mudler I posted some of the logs above, would you like to see more?

mudler commented 5 months ago

@Anto79-ops your log looks like incomplete, it seems it failed initially in a way that made the previous calls failing. Can you share the full log from the beginning of the session?

Anto79-ops commented 5 months ago

@mudler is it ok I email/dm a text file of the logs?

Anto79-ops commented 5 months ago

I just pulled the lastest master image and the problem is solved (for me, at least).

Thank you!

JackBekket commented 5 months ago

https://github.com/mudler/LocalAI/issues/1981 is related

you get this error because llama-cpp backend tries to offload whole model to GPU and fail because you have not enough VRAM

Workaround might be if you offload only part of your model layers to GPU

You need to create .yaml config file for your model like this:

name: wizard-uncensored-13b
f16: false # true to GPU acceleration
cuda: false # true to GPU acceleration
gpu_layers: 10 # this model have max 40 layers, 15-20 is reccomended for half-load at NVIDIA 4060 TiTan (more layers -- more VRAM required), (i guess 0 is no GPU)
parameters:
  model: wizard-uncensored-13b.gguf
#backend: diffusers
template:

  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

you should play aroung gpu_layers here, and check nvidia-smi

DavidGOrtega commented 5 months ago

I have this error with a custom model NeuralHermes. I have asked for help https://github.com/mudler/LocalAI/discussions/1992

JackBekket commented 5 months ago

I have this error with a custom model NeuralHermes. I have asked for help #1992

Have you checked that your VRAM is enough to offload all layers? you can try to split it

DavidGOrtega commented 5 months ago

@JackBekket is running in my preprod server

nvidia L4 32 cores 90 GB

the models that comes with the distro are running perfectly.

DavidGOrtega commented 5 months ago

@mudler I have the answer I downloaded the raw link file that its just plain text 🤦 thanks for your help

localai-bot commented 4 months ago

You're welcome! I'm glad you found the issue and managed to resolve it. If you need any further assistance, don't hesitate to reach out. Have a great day!

ytjhai commented 1 month ago

I'm having a similar issue. The following log:

api-1  | 9:50PM DBG Extracting backend assets files to /tmp/localai/backend_data
api-1  | 9:50PM DBG processing api keys runtime update
api-1  | 9:50PM DBG processing external_backends.json
api-1  | 9:50PM DBG external backends loaded from external_backends.json
api-1  | 9:50PM INF core/startup process completed!
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistants.json
api-1  | 9:50PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
api-1  | 9:50PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
api-1  | 9:50PM DBG Request received: {"model":"gte-qwen","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":"Your text string goes here","stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
api-1  | 9:50PM DBG guessDefaultsFromFile: not a GGUF file
api-1  | 9:50PM DBG Parameter Config: &{PredictionOptions:{Model:Alibaba-NLP/gte-Qwen2-7B-instruct Language: Translate:false N:0 TopP:0x4000630b90 TopK:0x4000630b68 Temperature:0x4000630a18 Maxtokens:0x4000630fc8 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0x4000630fc0 TypicalP:0x4000630f08 Seed:0x40006310a0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gte-qwen F16:0x4000630cb0 Threads:0x4000630cb8 Debug:0x4000585ab0 Roles:map[] Embeddings:0x4000630fe9 Backend:huggingface-embeddings TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[Your text string goes here] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0x4000630f00 MirostatTAU:0x4000630ee8 Mirostat:0x4000630ee0 NGPULayers:0x4000630fe0 MMap:0x4000630a17 MMlock:0x4000630fe9 LowVRAM:0x4000630fe9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0x4000630c30 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
api-1  | 9:50PM INF Loading model 'Alibaba-NLP/gte-Qwen2-7B-instruct' with backend huggingface-embeddings
api-1  | 9:50PM DBG Loading model in memory from file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct
api-1  | 9:50PM DBG Loading Model Alibaba-NLP/gte-Qwen2-7B-instruct with gRPC (file: /models/Alibaba-NLP/gte-Qwen2-7B-instruct) (backend: huggingface-embeddings): {backendString:huggingface-embeddings model:Alibaba-NLP/gte-Qwen2-7B-instruct threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x4000239b08 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
api-1  | 9:50PM DBG Loading external backend: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG Loading GRPC Process: /build/backend/python/sentencetransformers/run.sh
api-1  | 9:50PM DBG GRPC Service for Alibaba-NLP/gte-Qwen2-7B-instruct will be running at: '127.0.0.1:33329'
api-1  | 9:50PM DBG GRPC Service state dir: /tmp/go-processmanager1272549319
api-1  | 9:50PM DBG GRPC Service Started
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout Initializing libbackend for build
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv created
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 78: uv: command not found**
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 83:** /build/backend/python/sentencetransformers/venv/bin/activate: No such file or directory
**api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stderr /build/backend/python/sentencetransformers/../common/libbackend.sh: line 155: exec: python: not found**
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout virtualenv activated
api-1  | 9:50PM DBG GRPC(Alibaba-NLP/gte-Qwen2-7B-instruct-127.0.0.1:33329): stdout activated virtualenv has been ensured
api-1  | 9:51PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:33329: connect: connection refused\""
api-1  | 9:51PM DBG GRPC Service NOT ready
api-1  | 9:51PM ERR Server error error="grpc service not ready" ip=192.168.65.1 latency=40.12671406s method=POST status=500 url=/embeddings

I've highlighted the lines that sort of stood out to me. It would be good to have customized model files with examples using different backends.