Closed luoweb closed 4 months ago
LocalAI version:
V1.21 root@63429046747f:/build# ./local-ai --version LocalAI version 4548473 (4548473) llmai-api-1 | 3:04AM DBG Loading model '
Environment, CPU architecture, OS, and Version: Intel X86 centos
Describe the bug Grpc error
To Reproduce
Expected behavior Output response
Logs openllama7b' greedly llmai-api-1 | 3:04AM DBG [llama] Attempting to load llmai-api-1 | 3:04AM DBG Loading model llama from openllama7b llmai-api-1 | 3:04AM DBG Loading model in memory from file: /models/openllama7b llmai-api-1 | 3:04AM DBG Loading GRPC Model%!(EXTRA string=llama, model.Options={llama openllama7b 4 /tmp/localai/backend_data 0xc0000400b0 0xc000296a20}) llmai-api-1 | 3:04AM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama) llmai-api-1 | 3:04AM DBG GRPC Service for 'llama' (openllama7b) will be running at: 'localhost:37785' llmai-api-1 | 3:04AM DBG GRPC Service Started llmai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1">127.0.0.1:37785: connect: connection refused" llmai-api-1 | 3:04AM DBG GRPC(llama-openllama7b-localhost:37785): stderr 2023/07/19 03:04:00 gRPC Server listening at 127.0.0.1:37785 llmai-api-1 | 3:04AM DBG GRPC Service Ready llmai-api-1 | 3:04AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:/models/openllama7b ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
Additional context
the problem was solved, the error is normal, the word err misunderstood . May be the log should be more clear, when the model not config correctly , it generate many log like "rpc error: code = Unavailable desc = connection error: desc = "
Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!
_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.
Don't engage in conversation with me, I don't support (yet) replying!
The issue seems to be related to the GRPC service not starting correctly on the LocalAI server. The error message indicates that there was a connection refused when trying to connect to the GRPC port on localhost. This could be due to a misconfiguration of the GRPC server or a conflict with other services running on the same port. To troubleshoot this issue, you can try the following steps:
--debug
or DEBUG=true
) to see more detailed logs.Sources:
I had the same problem the model I used was alpaca.13b.ggmlv3.q8_0.bin with another model it worked ggml-model-gpt4all-falcon-q4_0.bin. Could it be because the alpaca.13b.ggmlv3.q8_0.bin model is a GPU model?
@luoweb - can you share what was configured incorrectly? im seeing this err too: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:43883: connect: connection refused" can i just ignore it? i'm using the default configuration
Same problem as above
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35733: connect: connection refused"
Default guide: Example: Use GPT4ALL-J model with docker-compose
I'm try to use this fantastic technology in google colab with following command. Because google colab is not support docker and I want use GPU. It's working fine on gitpod,only thing is that it's too slow. so I move to google colab.
./local-ai --models-path ./models --address 127.0.0.1:8889 --threads 4
I'm using dolly-v2-12b-q4_0.bin
But I got following response.
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44399: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34133: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37763: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46833: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39095: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:33899: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45317: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32935: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43033: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41829: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37653: connect: connection refused" rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40749: connect: connection refused"
Hello @gris-gris
Same problem as above
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:35733: connect: connection refused"
Default guide: Example: Use GPT4ALL-J model with docker-compose
I did noticed a similar problem on my side when the tmpl files are mixed up.
I am using container: local-ai:v1.21.0
Can you try to add the following line to your docker-compose ?
environment:
- DEBUG=true
- PRELOAD_MODELS=[{"url":"github:go-skynet/model-gallery/gpt4all-j.yaml"}]
- MODELS_PATH=/models
- REBUILD=false
volumes:
- ./models:/models:cached
docker-compose up looks now like:
api_1 | @@@@@
api_1 | Skipping rebuild
api_1 | @@@@@
api_1 | If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
api_1 | If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
api_1 | CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
api_1 | see the documentation at: https://localai.io/basics/build/index.html
api_1 | Note: See also https://github.com/go-skynet/LocalAI/issues/288
api_1 | @@@@@
api_1 | 9:01AM DBG no galleries to load
api_1 | 9:01AM INF Starting LocalAI using 14 threads, with models path: /models
api_1 | 9:01AM INF LocalAI version: v1.21.0 (fb6cce487fb53d9de1c1a6b3414261f52b5cdbe0)
api_1 | 9:01AM DBG Model: gpt4all-j (config: {PredictionOptions:{Model:ggml-gpt4all-j.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt4all-j StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:false NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:gpt4all-j TemplateConfig:{Completion:gpt4all-completion Functions: Chat:gpt4all-chat Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:}})
api_1 | 9:01AM DBG Extracting backend assets files to /tmp/localai/backend_data
api_1 | 9:01AM DBG Checking "ggml-gpt4all-j.bin" exists and matches SHA
api_1 | 9:01AM DBG File "ggml-gpt4all-j.bin" already exists and matches the SHA. Skipping download
api_1 | 9:01AM DBG Prompt template "gpt4all-completion" written
api_1 | 9:01AM DBG Prompt template "gpt4all-chat" written
api_1 | 9:01AM DBG Written config file /models/gpt4all-j.yaml
api_1 |
api_1 | ┌───────────────────────────────────────────────────┐
api_1 | │ Fiber v2.48.0 │
api_1 | │ http://127.0.0.1:8080 │
api_1 | │ (bound on host 0.0.0.0 and port 8080) │
api_1 | │ │
api_1 | │ Handlers ............ 33 Processes ........... 1 │
api_1 | │ Prefork ....... Disabled PID ................. 7 │
api_1 | └───────────────────────────────────────────────────┘
and my models folder on the host looks like:
-rw-rw-r-- 1 root root 3785248281 Jul 23 10:36 ggml-gpt4all-j.bin
-rw-r--r-- 1 root root 179 Jul 23 11:01 gpt4all-chat.tmpl
-rw-r--r-- 1 root root 57 Jul 23 11:01 gpt4all-completion.tmpl
-rw-r--r-- 1 root root 202 Jul 23 11:01 gpt4all-j.yaml
To test:
LOCALAI=http://localhost:8080
curl $LOCALAI/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt4all-j",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 2
}'
Hope that helps.
Im getting the same error!
2023-07-23 12:09:45 ┌───────────────────────────────────────────────────┐ 2023-07-23 12:09:45 │ Fiber v2.48.0 │ 2023-07-23 12:09:45 │ http://127.0.0.1:8080 │ 2023-07-23 12:09:45 │ (bound on host 0.0.0.0 and port 8080) │ 2023-07-23 12:09:45 │ │ 2023-07-23 12:09:45 │ Handlers ............ 31 Processes ........... 1 │ 2023-07-23 12:09:45 │ Prefork ....... Disabled PID ................. 7 │ 2023-07-23 12:09:45 └───────────────────────────────────────────────────┘ 2023-07-23 12:09:45 2023-07-23 12:11:51 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:42001: connect: connection refused" 2023-07-23 12:11:54 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:42001: connect: connection refused" 2023-07-23 12:11:54 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:42391: connect: connection refused" 2023-07-23 12:11:59 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:42391: connect: connection refused" 2023-07-23 12:11:59 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34697: connect: connection refused" 2023-07-23 12:12:04 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34697: connect: connection refused" 2023-07-23 12:12:04 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41587: connect: connection refused" 2023-07-23 12:12:18 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41587: connect: connection refused" 2023-07-23 12:12:18 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43633: connect: connection refused"
Thats my docker-compose
version: '3.6'
services: api: image: quay.io/go-skynet/local-ai:latest build: context: . dockerfile: Dockerfile ports:
After 1min or so i get a
Error: Request failed with status code 500
in Flowise
This is the Debug:
2023-07-23 12:29:27 10:29AM DBG Loading model bert-embeddings from bert-MiniLM-L6-v2q4_0.bin
2023-07-23 12:29:27 10:29AM DBG Model already loaded in memory: bert-MiniLM-L6-v2q4_0.bin
2023-07-23 12:29:27 10:29AM DBG GRPC Model not responding%!(EXTRA string=bert-MiniLM-L6-v2q4_0.bin)
2023-07-23 12:29:27 10:29AM DBG GRPC Process is not responding%!(EXTRA string=bert-MiniLM-L6-v2q4_0.bin)
2023-07-23 12:29:27 10:29AM DBG Loading model in memory from file: /models/bert-MiniLM-L6-v2q4_0.bin
2023-07-23 12:29:27 10:29AM DBG Loading GRPC Model%!(EXTRA string=bert-embeddings, model.Options={bert-embeddings bert-MiniLM-L6-v2q4_0.bin 4 /tmp/localai/backend_data 0xc000126000 0xc00109fa70})
2023-07-23 12:29:27 10:29AM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/bert-embeddings)
2023-07-23 12:29:27 10:29AM DBG GRPC Service for 'bert-embeddings' (bert-MiniLM-L6-v2q4_0.bin) will be running at: 'localhost:38181'
2023-07-23 12:29:27 10:29AM DBG GRPC Service Started
2023-07-23 12:29:27 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38181: connect: connection refused"
2023-07-23 12:29:27 10:29AM DBG GRPC(bert-embeddings-bert-MiniLM-L6-v2q4_0.bin-localhost:38181): stderr 2023/07/23 10:29:27 gRPC Server listening at 127.0.0.1:38181
2023-07-23 12:29:28 10:29AM DBG GRPC Service Ready
2023-07-23 12:29:28 10:29AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:
@finger42 - what about the helm chart? im working with k8sgpt and localai as the ai backend. when k8sgpt is up - imstarting to see those connection refused in the local-ai pod logs. However, im able to curl to local ai from a curl pod. it works for me before with localAi and k8sgpt - with the same configuration.
im getting DBG no galleries to load , what can be the solution for this?
@luoweb - can you share what was configured incorrectly? im seeing this err too: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:43883: connect: connection refused" can i just ignore it? i'm using the default configuration
I start successful without gpu binary, as it may failed repeat, the model configuration see the repo models-gallery: https://github.com/go-skynet/model-gallery/blob/main/openllama_7b.yaml
Not sure if this is useful but this is what I see:
F
* could not load model: rpc error: code = Unknown desc = stat /models/ggml-gpt4all-j: no such file or directory
* could not load model: rpc error: code = Unknown desc = stat /models/ggml-gpt4all-j: no such file or directory
* could not load model: rpc error: code = Unknown desc = unsupported model type /models/ggml-gpt4all-j (should end with .onnx)
* backend unsupported: /build/extra/grpc/huggingface/huggingface.py
Same error; using dual Xeon's, I tried REBUILD=true
but that failed (maybe telling). I'm using Intel(R) Xeon(R) CPU X5570, which is not modern but I would still expect it to basically work? I tried many different models, but always get the same rpc error on calling completion. I guess the RPC error is not specific and it just means the internal endpoint didn't start for some other reason. Happy to try steps for debugging if someone can guide me.
Can you try to add the following line to your docker-compose ?
Thank you for the debug request.
The following is the logs with two requests "How are you?
" identical to the general example:
Same error; using dual Xeon's, I tried
REBUILD=true
but that failed (maybe telling). I'm using Intel(R) Xeon(R) CPU X5570, which is not modern but I would still expect it to basically work? I tried many different models, but always get the same rpc error on calling completion. I guess the RPC error is not specific and it just means the internal endpoint didn't start for some other reason. Happy to try steps for debugging if someone can guide me.
Chris, Can you try following https://github.com/go-skynet/LocalAI/issues/288#issuecomment-1666811444. I think REBUILD=true is failing because it doesn't have all required flags.
Can you try following #288 (comment). I think REBUILD=true is failing because it doesn't have all required flags.
Thank you! I have tried building with these variables and running and got another response, but its still crashes. Well, now, the causer is an assert: GGML_ASSERT: /build/go-ggml-transformers/ggml.cpp/src/ggml.c:4128: wtype != GGML_TYPE_COUNT
.
Docker container logs: https://paste.gg/p/anonymous/7a01dde6725748cd8a85beea77fab165
However, in the logs, it states: localai-api-1 | I CMAKE_ARGS:
.
The file which declares the value to read by make
is ./go-ggllm/Makefile
.
Meaning, the variable value may not pass to the required environment. The variable initialization was set in .env
, and the default Docker compose file was used.
LLAMA.CPP backend is definitely look healthy now. it no longer fails on load. To me it looks like corrupted model file.
| localai-api-1 \| [90m3:29PM[0m [33mDBG[0m GRPC(ggml-gpt4all-j-127.0.0.1:38783): stderr create_gpt_params: loading model /models/ggml-gpt4all-j
| localai-api-1 \| [90m3:29PM[0m [33mDBG[0m GRPC(ggml-gpt4all-j-127.0.0.1:38783): stderr llama.cpp: loading model from /models/ggml-gpt4all-j
| localai-api-1 \| [90m3:30PM[0m [33mDBG[0m GRPC(ggml-gpt4all-j-127.0.0.1:38783): stderr error loading model: unexpectedly reached end of file
| localai-api-1 \| [90m3:30PM[0m [33mDBG[0m GRPC(ggml-gpt4all-j-127.0.0.1:38783): stderr llama_load_model_from_file: failed to load model
| localai-api-1 \| [90m3:30PM[0m [33mDBG[0m GRPC(ggml-gpt4all-j-127.0.0.1:38783): stderr llama_init_from_gpt_params: error: failed to load model '/models/ggml-gpt4all-j'
The error you are referring to comes from another backend, not from llama.cpp. Because local-ai is trying to load model with all available backends, if configured backend failed.
I didn't use REBUILD=true though, I just directly modified Dockerfile to pass correct parameters, and build new container. This way is easier for me, and I need to wait build once, rather then on every start of the container. Here is patch file:
diff --git a/Dockerfile b/Dockerfile
index f67a1f3..0650ff8 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -13,6 +13,7 @@ ARG TARGETVARIANT
ENV BUILD_TYPE=${BUILD_TYPE}
ENV EXTERNAL_GRPC_BACKENDS="huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py"
ARG GO_TAGS="stablediffusion tts"
+ARG BUILD_PARALLELISM
RUN apt-get update && \
apt-get install -y ca-certificates cmake curl patch pip
@@ -80,6 +81,8 @@ ENV GO_TAGS=${GO_TAGS}
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENV NVIDIA_REQUIRE_CUDA="cuda>=${CUDA_MAJOR_VERSION}.0"
ENV NVIDIA_VISIBLE_DEVICES=all
+ENV CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF -DLLAMA_F16C=OFF"
+ENV BUILD_PARALLELISM=${BUILD_PARALLELISM}
WORKDIR /build
@@ -90,7 +93,7 @@ RUN make prepare
COPY . .
COPY .git .
-RUN ESPEAK_DATA=/build/lib/Linux-$(uname -m)/piper_phonemize/lib/espeak-ng-data make build
+RUN ESPEAK_DATA=/build/lib/Linux-$(uname -m)/piper_phonemize/lib/espeak-ng-data make build -j${BUILD_PARALLELISM:-1}
+ENV CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF -DLLAMA_F16C=OFF"
Thank you! It seems like there are two DLLAMA_F16C
defined. Also, may I ask you share the exact steps you achieve to have it working?
LLAMA.CPP backend is definitely look healthy now... I didn't use REBUILD=true though, I just directly modified Dockerfile to...
I removed everything from ./model/
, the Docker container and image, and tried building the image with a modified Dockerfile, yet the model still fails to load after these slightly modified two requests with an error: 'Model format not supported (no matching implementation found)'
.
What might it be? I would appreciate for a suggestion.
What I should point out here, is that cmake options above targetes llama.cpp, and this one works fine. it is possible that options have no effect on gpt backend, and you endup in the same situation. or it could be completely different issue with gpt backend.
I wonder, if you are able to change backend to llama?
I'm running this in proxmox VM, that runs k8s worker, inside pod that runs local-ai. here is example of my working model config file, though I'm using LLAMA2 7b model, but llama.cpp supports gpt models:
name: gpt-3.5-turbo
backend: llama
parameters:
model: llama-2-7b-chat.ggmlv3.q4_0.bin
top_k: 40
temperature: 0.3
top_p: 1
context_size: 4096
🟡 Partially solved by downloading again the model.
I'm running this in proxmox VM, that runs k8s worker, inside pod that runs local-ai. here is example of my working model config file, though I'm using LLAMA2 7b model, but llama.cpp supports gpt models:
Thank you very much for the share. I am still trying to grasp the idea how to run it all exactly. Currently, even if I do get it somehow running without a crash like at this, it would run the whole day (until manually stopped) at ~4x100% relatively old but CPU Xeon E5620 load (8 physical in total), and nothing in the result. It's like it gets into an infinite loop somewhere and the same is with "gpt4all-j" and "ggml-gpt4all-j.bin" tries:
@pasqualesimonetti , may I ask how did you get it all running from the very beginning? Like, do you compile the local-ai
in the first place on your machine (i.e. modified Dockerfile
, REBUILD='true'
etc.)? If so, do you modify any environment variables or other Docker configuration/dependencies? How does your docker-compose.yaml
file look like and how did you install items into ./model/
?
@pasqualesimonetti , may I ask how did you get it all running from the very start beginning? Like, do you compile the
local-ai
in the first place on your machine (i.e. modifiedDockerfile
,REBUILD='true'
etc.)? If so, do you modify any environment variables or other Docker configuration/dependencies? How does yourdocker-compose.yaml
file look like and how did you install items into./model/
?
I run it locally and not on Docker. I just downloaded the model again.
# Download gpt4all-j to models/
wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
Probably you have to download the Docker image again.
Anyway, even if I have been able to make simple curl requests ( such as curl http://localhost:8080/models/available
) still I have RPC error issues when trying to run https://github.com/aorumbayev/autogpt4all
or https://github.com/go-skynet/LocalAI/tree/master/examples/autoGPT
.
I got the same error , I have tried the following two method , but none work. https://github.com/go-skynet/LocalAI/issues/771#issuecomment-1675026506 https://github.com/go-skynet/LocalAI/issues/771#issuecomment-1646789212
Same error. That's a bad experience from a simple quick getting start example.
Maybe this is caused by "out of memory". I tried with docker with 2GB memory limit, monitoring the 'docker stats'. When the memory usege went 2GB it soon went down and gave this error message.
Maybe this is caused by "out of memory". I tried with docker with 2GB memory limit, monitoring the 'docker stats'. When the memory usege went 2GB it soon went down and gave this error message.
Yes, I rebooted my MacBook Pro 2018 with Intel i7 and freed the memory, and then LocalAI loaded the model in memory, and it worked. Even tho on my laptop performance was terribly slow.
Anyway, I have to better test it. So I cannot provide a definitive solution yet.
I'm getting the same on Ubuntu 20.04 with 16 GB RAM available:
api_1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40173: connect: connection refused"
LocalAI version: V1.21 root@63429046747f:/build# ./local-ai --version LocalAI version 4548473 (4548473) llmai-api-1 | 3:04AM DBG Loading model ' Environment, CPU architecture, OS, and Version: Intel X86 centos Describe the bug Grpc error To Reproduce Expected behavior Output response Logs openllama7b' greedly llmai-api-1 | 3:04AM DBG [llama] Attempting to load llmai-api-1 | 3:04AM DBG Loading model llama from openllama7b llmai-api-1 | 3:04AM DBG Loading model in memory from file: /models/openllama7b llmai-api-1 | 3:04AM DBG Loading GRPC Model%!(EXTRA string=llama, model.Options={llama openllama7b 4 /tmp/localai/backend_data 0xc0000400b0 0xc000296a20}) llmai-api-1 | 3:04AM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama) llmai-api-1 | 3:04AM DBG GRPC Service for 'llama' (openllama7b) will be running at: 'localhost:37785' llmai-api-1 | 3:04AM DBG GRPC Service Started llmai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1">127.0.0.1:37785: connect: connection refused" llmai-api-1 | 3:04AM DBG GRPC(llama-openllama7b-localhost:37785): stderr 2023/07/19 03:04:00 gRPC Server listening at 127.0.0.1:37785 llmai-api-1 | 3:04AM DBG GRPC Service Ready llmai-api-1 | 3:04AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:/models/openllama7b ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:} Additional context
the problem was solved, the error is normal, the word err misunderstood . May be the log should be more clear, when the model not config correctly , it generate many log like "rpc error: code = Unavailable desc = connection error: desc = "
Hi,bro. How did you solve this problem?
BUMP!
I wish there was a book for LocalAI dummies, because I need it right now.
I'm new to the world of AI and really wanting to get this working. I have great hardware, so that shouldn't be an issue.
I'm getting the same on Ubuntu 20.04 with 16 GB RAM available:
api_1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40173: connect: connection refused"
Same with Ubuntu 22.04 LTS
Related? https://github.com/go-skynet/LocalAI/issues/1109
I can get models to work (very slowly on my dual Xeon's) but so var I can't get a blip out of my GPU while watching nvtop
. Same I am able on the same docker instance to run kestr3l/stable-diffusion-webui
and nvidia-smi
is working in the host VM and in the containers.
I'm getting the same issue with.
{
"url": "github:go-skynet/model-gallery/llama2-7b-chat-gguf.yaml",
"name": "llama2-7b-chat-gguf",
"overrides": {
"f16": "true",
"gpu_layers": "30"
}
}
Probably user error on my part but I've been trying for a couple days to find a working web frontend for localAI (ghcr.io/mckaywrigley/chatbot-ui
can't select models and only sees GPT 3.5 (CPU only), quay.io/go-skynet/localai-frontend
is choking on cors errros and donesn't seem to work at all. I got LibreChat to work https://github.com/danny-avila/LibreChat/issues/1027 but can't get any GPU backed models running.
Thanks for the great container, I really love the idea. Any help is appreciate :)
To those who have "out of memory" issues, I suggest these Huggingface Spaces https://huggingface.co/spaces/hf-accelerate/model-memory-usage, https://huggingface.co/spaces/Vokturz/can-it-run-llm to understand which model can be used on the hardware.
Also getting a similar issue here.
.env
THREADS=8
CONTEXT_SIZE=4096
GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
MODELS_PATH=/models
DEBUG=true
COMPEL=0
SINGLE_ACTIVE_BACKEND=true
BUILD_TYPE=cublas
REBUILD=true
GO_TAGS=stablediffusion
IMAGE_PATH=/tmp
docker-compose.yaml
version: '3.6'
services:
api:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
image: quay.io/go-skynet/local-ai:sha-238fec2-cublas-cuda12-ffmpeg-core
tty: true # enable colorized logs
restart: always # should this be on-failure ?
ports:
- 8080:8080
env_file:
- .env
volumes:
- ./models:/models
- ./images/:/tmp/generated/images/
command: ["/usr/bin/local-ai" ]
Request & Error
…/AI/LocalAI שׂ master via 🐹 on ☁️ (us-east-1)
🕙 19:58:46 ❯❯ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama2-7b-chat-gguf",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'
{"error":{"code":500,"message":"could not load model: rpc error: code = Unknown desc = failed loading model","type":""}}⏎
Container Logs
2023-12-03 19:58:44 12:58AM ERR error processing message {SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions Role:User: RoleName:user Content:How are you? MessageIndex:0} using template "llama2-7b-chat-gguf-chat": template: prompt:3:5: executing "prompt" at <.Input>: can't evaluate field Input in type model.ChatMessageTemplateData. Skipping!
2023-12-03 19:58:44 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46545: connect: connection refused"
Debug
2023-12-03 20:06:12 [127.0.0.1]:39930 200 - GET /readyz
2023-12-03 20:06:52 1:06AM DBG Request received:
2023-12-03 20:06:52 1:06AM DBG Configuration read: &{PredictionOptions:{Model: Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b-chat-gguf F16:false Threads:8 Debug:true Roles:map[assistant:Assitant: assistant_function_call:Function Call: function:Function Result: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-7b-chat-gguf-chat Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:06:52 1:06AM DBG Parameters: &{PredictionOptions:{Model: Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llama2-7b-chat-gguf F16:false Threads:8 Debug:true Roles:map[assistant:Assitant: assistant_function_call:Function Call: function:Function Result: system:System: user:User:] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-7b-chat-gguf-chat Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:06:52 1:06AM ERR error processing message {SystemPrompt:You are a helpful assistant, below is a conversation, please respond with the next message and do not ask follow-up questions Role:User: RoleName:user Content:How are you? MessageIndex:0} using template "llama2-7b-chat-gguf-chat": template: prompt:3:5: executing "prompt" at <.Input>: can't evaluate field Input in type model.ChatMessageTemplateData. Skipping!
2023-12-03 20:06:52 1:06AM DBG Prompt (before templating): User:How are you?
2023-12-03 20:06:52 1:06AM DBG Template failed loading: failed loading a template for
2023-12-03 20:06:52 1:06AM DBG Prompt (after templating): User:How are you?
2023-12-03 20:06:52 1:06AM DBG Loading model llama from
2023-12-03 20:06:52 1:06AM DBG Stopping all backends except ''
2023-12-03 20:06:52 1:06AM DBG Loading model in memory from file: /models
2023-12-03 20:06:52 1:06AM DBG Loading Model with gRPC (file: /models) (backend: llama): {backendString:llama model: threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001da5a0 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
2023-12-03 20:06:52 1:06AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
2023-12-03 20:06:52 1:06AM DBG GRPC Service for will be running at: '127.0.0.1:34533'
2023-12-03 20:06:52 1:06AM DBG GRPC Service state dir: /tmp/go-processmanager3341423294
2023-12-03 20:06:52 1:06AM DBG GRPC Service Started
2023-12-03 20:06:53 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34533: connect: connection refused"
2023-12-03 20:06:53 1:06AM DBG GRPC(-127.0.0.1:34533): stderr 2023/12/04 01:06:53 gRPC Server listening at 127.0.0.1:34533
2023-12-03 20:06:55 1:06AM DBG GRPC Service Ready
2023-12-03 20:06:55 1:06AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model: ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr create_gpt_params_cuda: loading model /models
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr ggml_init_cublas: found 1 CUDA devices:
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr gguf_init_from_file: invalid magic number 00000000
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr error loading model: llama_model_loader: failed to load model from /models
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr llama_load_model_from_file: failed to load model
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr llama_init_from_gpt_params: error: failed to load model '/models'
2023-12-03 20:06:55 1:06AM DBG GRPC(-127.0.0.1:34533): stderr load_binding_model: error: unable to load model
2023-12-03 20:06:55 [172.18.0.1]:54898 500 - POST /v1/chat/completions
2023-12-03 20:07:12 [127.0.0.1]:37870 200 - GET /readyz
If I change to the lunademo
model from the model-gallery (also used in the model setup how-to), I get many more errors in debug:
2023-12-03 20:12:13 [127.0.0.1]:51240 200 - GET /readyz
2023-12-03 20:12:28 1:12AM DBG Request received:
2023-12-03 20:12:28 1:12AM DBG Configuration read: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat-message ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:12:28 1:12AM DBG Parameters: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q4_K_M.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:luna-chat-message ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:4096 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
2023-12-03 20:12:28 1:12AM DBG Prompt (before templating): How are you?
2023-12-03 20:12:28 1:12AM DBG Template found, input modified to: How are you?
2023-12-03 20:12:28
2023-12-03 20:12:28 ASSISTANT:
2023-12-03 20:12:28
2023-12-03 20:12:28 1:12AM DBG Prompt (after templating): How are you?
2023-12-03 20:12:28
2023-12-03 20:12:28 ASSISTANT:
2023-12-03 20:12:28
2023-12-03 20:12:28 1:12AM DBG Loading model llama from luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:28 1:12AM DBG Stopping all backends except 'luna-ai-llama2-uncensored.Q4_K_M.gguf'
2023-12-03 20:12:28 1:12AM DBG [single-backend] Stopping
2023-12-03 20:12:28 1:12AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:28 1:12AM DBG Loading Model luna-ai-llama2-uncensored.Q4_K_M.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q4_K_M.gguf) (backend: llama): {backendString:llama model:luna-ai-llama2-uncensored.Q4_K_M.gguf threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001da5a0 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:true parallelRequests:false}
2023-12-03 20:12:28 1:12AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
2023-12-03 20:12:28 1:12AM DBG GRPC Service for luna-ai-llama2-uncensored.Q4_K_M.gguf will be running at: '127.0.0.1:45223'
2023-12-03 20:12:28 1:12AM DBG GRPC Service state dir: /tmp/go-processmanager3150385545
2023-12-03 20:12:28 1:12AM DBG GRPC Service Started
2023-12-03 20:12:28 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45223: connect: connection refused"
2023-12-03 20:12:28 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr 2023/12/04 01:12:28 gRPC Server listening at 127.0.0.1:45223
2023-12-03 20:12:30 1:12AM DBG GRPC Service Ready
2023-12-03 20:12:30 1:12AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q4_K_M.gguf ContextSize:4096 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q4_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0}
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr create_gpt_params_cuda: loading model /models/luna-ai-llama2-uncensored.Q4_K_M.gguf
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr ggml_init_cublas: found 1 CUDA devices:
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/luna-ai-llama2-uncensored.Q4_K_M.gguf (version GGUF V2 (latest))
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 3: blk.0.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 6: blk.0.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 7: blk.0.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 10: blk.1.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 11: blk.1.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 12: blk.1.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 13: blk.1.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 15: blk.1.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 16: blk.1.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 19: blk.2.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 20: blk.2.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 21: blk.2.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 22: blk.2.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 24: blk.2.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 25: blk.2.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 28: blk.3.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 29: blk.3.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 30: blk.3.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 31: blk.3.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 33: blk.3.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 34: blk.3.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 37: blk.4.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 38: blk.4.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 39: blk.4.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 40: blk.4.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 42: blk.4.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 43: blk.4.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 46: blk.5.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 47: blk.5.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 48: blk.5.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 49: blk.5.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 51: blk.5.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 52: blk.5.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 55: blk.6.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 56: blk.6.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 57: blk.6.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 58: blk.6.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 60: blk.6.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 61: blk.6.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 64: blk.7.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 65: blk.7.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 66: blk.7.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 67: blk.7.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 69: blk.7.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 70: blk.7.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 73: blk.8.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 74: blk.8.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 75: blk.8.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 76: blk.8.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 78: blk.8.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 79: blk.8.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 82: blk.9.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 83: blk.9.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 84: blk.9.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 85: blk.9.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 87: blk.9.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 88: blk.9.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 91: blk.10.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 92: blk.10.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 93: blk.10.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 94: blk.10.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 96: blk.10.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 97: blk.10.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 100: blk.11.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 101: blk.11.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 102: blk.11.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 103: blk.11.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 105: blk.11.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 106: blk.11.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 109: blk.12.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 110: blk.12.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 111: blk.12.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 112: blk.12.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 114: blk.12.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 115: blk.12.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 118: blk.13.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 119: blk.13.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 120: blk.13.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 121: blk.13.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 123: blk.13.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 124: blk.13.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 127: blk.14.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 128: blk.14.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 129: blk.14.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 130: blk.14.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 132: blk.14.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 133: blk.14.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 136: blk.15.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 137: blk.15.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 138: blk.15.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 139: blk.15.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 141: blk.15.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 142: blk.15.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 145: blk.16.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 146: blk.16.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 147: blk.16.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 148: blk.16.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 150: blk.16.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 151: blk.16.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 154: blk.17.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 155: blk.17.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 156: blk.17.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 157: blk.17.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 159: blk.17.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 160: blk.17.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 163: blk.18.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 164: blk.18.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 165: blk.18.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 166: blk.18.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 168: blk.18.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 169: blk.18.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 172: blk.19.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 173: blk.19.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 174: blk.19.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 175: blk.19.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 177: blk.19.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 178: blk.19.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 181: blk.20.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 182: blk.20.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 183: blk.20.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 184: blk.20.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 186: blk.20.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 187: blk.20.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 190: blk.21.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 191: blk.21.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 192: blk.21.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 193: blk.21.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 195: blk.21.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 196: blk.21.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 199: blk.22.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 200: blk.22.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 201: blk.22.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 202: blk.22.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 204: blk.22.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 205: blk.22.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 208: blk.23.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 209: blk.23.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 210: blk.23.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 211: blk.23.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 213: blk.23.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 214: blk.23.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 217: blk.24.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 218: blk.24.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 219: blk.24.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 220: blk.24.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 222: blk.24.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 223: blk.24.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 226: blk.25.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 227: blk.25.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 228: blk.25.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 229: blk.25.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 231: blk.25.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 232: blk.25.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 235: blk.26.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 236: blk.26.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 237: blk.26.attn_v.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 238: blk.26.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 240: blk.26.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 241: blk.26.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 244: blk.27.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 245: blk.27.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 246: blk.27.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 247: blk.27.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 249: blk.27.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 250: blk.27.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 253: blk.28.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 254: blk.28.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 255: blk.28.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 256: blk.28.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 258: blk.28.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 259: blk.28.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 262: blk.29.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 263: blk.29.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 264: blk.29.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 265: blk.29.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 267: blk.29.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 268: blk.29.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 271: blk.30.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 272: blk.30.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 273: blk.30.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 274: blk.30.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 276: blk.30.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 277: blk.30.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 280: blk.31.attn_q.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 281: blk.31.attn_k.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 282: blk.31.attn_v.weight q6_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 283: blk.31.attn_output.weight q4_K [ 4096, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 285: blk.31.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 286: blk.31.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ]
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 0: general.architecture str
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 1: general.name str
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 2: llama.context_length u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 3: llama.embedding_length u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 4: llama.block_count u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 7: llama.attention.head_count u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 10: general.file_type u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - kv 18: general.quantization_version u32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - type f32: 65 tensors
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - type q4_K: 193 tensors
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_model_loader: - type q6_K: 33 tensors
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: format = GGUF V2 (latest)
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: arch = llama
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: vocab type = SPM
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_vocab = 32000
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_merges = 0
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_ctx_train = 2048
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_ctx = 4096
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_embd = 4096
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_head = 32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_head_kv = 32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_layer = 32
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_rot = 128
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_gqa = 1
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: f_norm_eps = 0.0e+00
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: n_ff = 11008
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: freq_base = 10000.0
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: freq_scale = 1
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: model type = 7B
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: model ftype = mostly Q4_K - Medium
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: model params = 6.74 B
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: general.name = tap-m_luna-ai-llama2-uncensored
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: BOS token = 1 '<s>'
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: EOS token = 2 '</s>'
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: UNK token = 0 '<unk>'
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: PAD token = 0 '<unk>'
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_print_meta: LF token = 13 '<0x0A>'
2023-12-03 20:12:30 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: ggml ctx size = 3891.34 MB
2023-12-03 20:12:32 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: using CUDA for GPU acceleration
2023-12-03 20:12:32 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: mem required = 3891.34 MB (+ 4096.00 MB per state)
2023-12-03 20:12:32 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: offloading 0 repeating layers to GPU
2023-12-03 20:12:32 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: offloaded 0/35 layers to GPU
2023-12-03 20:12:32 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llm_load_tensors: VRAM used: 0 MB
2023-12-03 20:12:35 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr ..................................................................................................
2023-12-03 20:12:37 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_new_context_with_model: kv self size = 4096.00 MB
2023-12-03 20:12:37 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_new_context_with_model: compute buffer total size = 281.47 MB
2023-12-03 20:12:37 1:12AM DBG GRPC(luna-ai-llama2-uncensored.Q4_K_M.gguf-127.0.0.1:45223): stderr llama_new_context_with_model: VRAM scratch buffer: 280.00 MB
2023-12-03 20:13:13 [127.0.0.1]:34806 200 - GET /readyz
-rw-rw-r-- 1 root root 3785248281 Jul 23 10:36 ggml-gpt4all-j.bin -rw-r--r-- 1 root root 179 Jul 23 11:01 gpt4all-chat.tmpl -rw-r--r-- 1 root root 57 Jul 23 11:01 gpt4all-completion.tmpl -rw-r--r-- 1 root root 202 Jul 23 11:01 gpt4all-j.yaml
@finger42 could you please show the three txt file content(tmpl, yaml) ?
here is what I did.
# Download gpt4all-j to models/
LocalAI$ wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
LocalAI$ sha1sum models/ggml-gpt4all-j
ccfbaf3e2406f442f3b4fefb8672e706aafd522a models/ggml-gpt4all-j
# Use a template from the examples
LocalAI$ cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/
LocalAI$ docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
I'm still having this issue. It should not have been marked as completed.
@TheRealAlexV my bad must have been a miss click as your right, should not have been closed, did you follow the how tos on the site?
I had the same problem the model I used was alpaca.13b.ggmlv3.q8_0.bin with another model it worked ggml-model-gpt4all-falcon-q4_0.bin. Could it be because the alpaca.13b.ggmlv3.q8_0.bin model is a GPU model?
does it working with other model?
I just tried the llm that mentioned in the example (luna-ai) and have this problem too
I'm also encountering this problem. I have attempted with several models, including ones supported in the model gallery. The same error:
Loading model 'ggml-gpt4all-j-v1.3-groovy.bin' with backend bert-embeddings
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34109: connect: connection refused"
using this command to launch within a VM:
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
I could understand any LocalAI Dev or contributor being frustrated seeing so many reports of:
transport: Error while dialing: dial tcp 127.0.0.1:nnnnn
...since it's clear this is not an underlying error, but a symptom of the back-end failing, for any number of reasons.
This does highlight a design issue in the way that failures are surfaced from LocalAI. This will continue to be a pain-point, unless a way can be found to surface the 'real' errors better. I think that deserves to be a priority: without a resolution, I guarantee LocalAI will continue shedding users, no matter how good the rest of the features.
I could understand any LocalAI Dev or contributor being frustrated seeing so many reports of:
transport: Error while dialing: dial tcp 127.0.0.1:34109
...since it seems fairly clear that this is not an underlying error, but rather a symptom of a back-end portion failing for any number of reasons. This does highlight a design issue in the way that failures are surfaced from LocalAI. I think this is going to go on being a 'quality of life' problem unless a way can be found to surface the 'real' errors better.
I concur. To me this is clearly an unexpected behavior: the error message that we see and notice has to be more specific than that. I still have very little clue why I have never succeded in running LocalAI for months now (trying every new release, different models, different build etc) but always gave up because finding the root cause without clear error message is too inefficient.
well. I do not know many about go. But I tried the localai docker and bin. I found help in bin. But I can not run three bin. What I want to say is that you can add --debug true
to run docker to debug with more info. such as
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4 --debug true
I did try adding any debug flag I find, it never got me passed that error message
I did try adding any debug flag I find, it never got me passed that error message
Same here. I don't know enough about LocalAI's architecture to describe it better but it's evident from Issue reports here that there's an entire layer of error reporting being effectively hidden from users, at least in the Docker image setup.
I remember debug would give more info. So I do it again to show the diff. This is only for proved that turn debug on would show more info. I am not sure about how to set up the value of other params(context_size, f16, top_p...) or tmpl.
gpt-3.5-turbo-16k.yaml
name: gpt-3.5-turbo-16k-0613
parameters:
model: ggml-gpt4all-j
top_k: 80
temperature: 0.2
top_p: 0.7
f16: true
context_size: 16384
template:
completion: ggml-gpt4all-j
chat: ggml-gpt4all-j
ggml-gpt4all-j.tmpl
The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
{{.Input}}
### Response:
$ curl http://localhost:8080/v1/models
{"object":"list","data":[{"id":"open_llama_7b_v2","object":"model"},{"id":"gpt-3.5-turbo-16k-0613","object":"model"},{"id":"WizardLM-7B-uncensored-GGUF","object":"model"},{"id":"alpaca-cache","object":"model"},{"id":"luna-ai-llama2","object":"model"},{"id":"models","object":"model"}]}
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "gpt-3.5-turbo-16k-0613",
> "messages": [{"role": "user", "content": "How are you?"}],
> "temperature": 0.9
> }'
{"created":1704535407,"object":finish_reason":"stop","message"
$ docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 16000 --threads 10
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name : Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities
CPU: AVX found OK
CPU: AVX2 found OK
CPU: AVX512 found OK
@@@@@
10:03AM INF Starting LocalAI using 10 threads, with models path: /models
10:03AM INF LocalAI version: v1.40.0 (6ef7ea2635ae5371be4e6eef516c2ad4afd9f4a3)
┌───────────────────────────────────────────────────┐
│ Fiber v2.50.0 │
│ http://127.0.0.1:8080 │
│ (bound on host 0.0.0.0 and port 8080) │
│ │
│ Handlers ............ 73 Processes ........... 1 │
│ Prefork ....... Disabled PID ................ 14 │
└───────────────────────────────────────────────────┘
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37487: connect: connection refused"
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38119: connect: connection refused"
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32775: connect: connection refused"
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44117: connect: connection refused"
$ curl http://localhost:8080/v1/models
{"object":"list","data":[{"id":"gpt-3.5-turbo-16k-0613","object":"model"},{"id":"open_llama_7b_v2","object":"model"},{"id":"WizardLM-7B-uncensored-GGUF","object":"model"},{"id":"alpaca-cache","object":"model"},{"id":"luna-ai-llama2","object":"model"},{"id":"models","object":"model"}]}
$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
> "model": "gpt-3.5-turbo-16k-0613",
> "messages": [{"role": "user", "content": "How are you?"}],
> "temperature": 0.9
> }'
{"created":1704535772,"object":"chat.completion","id":"bdd5fa1b-35b7-4b56-b25b-95bc3d86af6b","model":"gpt-3.5-turbo-16k-0613","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I'm doing well. How about you?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
$ docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 16000 --threads 10 --debug true
@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
CPU info:
model name : Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities
CPU: AVX found OK
CPU: AVX2 found OK
CPU: AVX512 found OK
@@@@@
10:09AM INF Starting LocalAI using 10 threads, with models path: /models
10:09AM INF LocalAI version: v1.40.0 (6ef7ea2635ae5371be4e6eef516c2ad4afd9f4a3)
10:09AM DBG Model: gpt-3.5-turbo-16k-0613 (config: {PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo-16k-0613 F16:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:ggml-gpt4all-j ChatMessage: Completion:ggml-gpt4all-j Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:16384 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
10:09AM DBG Model: open_llama_7b_v2 (config: {PredictionOptions:{Model:openlm-research-open_llama_7b_v2-Q8_0.gguf Language: N:0 TopP:0 TopK:0 Temperature:0.3 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:open_llama_7b_v2 F16:false Threads:10 Debug:false Roles:map[assistant:### Response: system:### System Instruction: user:### Instruction:] Embeddings:false Backend:llama TemplateConfig:{Chat:llama2-chat-message ChatMessage: Completion:completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath:alpaca-cache PromptCacheAll:true PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[HUMAN: ### Response:] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}})
10:09AM DBG Extracting backend assets files to /tmp/localai/backend_data
┌───────────────────────────────────────────────────┐
│ Fiber v2.50.0 │
│ http://127.0.0.1:8080 │
│ (bound on host 0.0.0.0 and port 8080) │
│ │
│ Handlers ............ 74 Processes ........... 1 │
│ Prefork ....... Disabled PID ................ 14 │
└───────────────────────────────────────────────────┘
[172.17.0.1]:34754 200 - GET /v1/models
10:10AM DBG Request received:
10:10AM DBG Configuration read: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo-16k-0613 F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:ggml-gpt4all-j ChatMessage: Completion:ggml-gpt4all-j Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:16384 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
10:10AM DBG Parameters: &{PredictionOptions:{Model:ggml-gpt4all-j Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-3.5-turbo-16k-0613 F16:false Threads:10 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:ggml-gpt4all-j ChatMessage: Completion:ggml-gpt4all-j Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:16384 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0 Quantization:} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
10:10AM DBG Prompt (before templating): How are you?
10:10AM DBG Template found, input modified to: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
How are you?
### Response:
10:10AM DBG Prompt (after templating): The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
### Prompt:
How are you?
### Response:
10:10AM DBG Loading model 'ggml-gpt4all-j' greedly from all the available backends: llama-cpp, llama-stable, llama, gpt4all, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, rwkv, whisper, stablediffusion, piper, /build/extra/grpc/diffusers/backend_diffusers.py, /build/extra/grpc/exllama/exllama.py, /build/extra/grpc/vall-e-x/ttsvalle.py, /build/extra/grpc/vllm/backend_vllm.py, /build/extra/grpc/huggingface/huggingface.py, /build/extra/grpc/autogptq/autogptq.py, /build/extra/grpc/bark/ttsbark.py
10:10AM DBG [llama-cpp] Attempting to load
10:10AM DBG Loading model llama-cpp from ggml-gpt4all-j
10:10AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
10:10AM DBG Loading GRPC Model llama-cpp: {backendString:llama-cpp model:ggml-gpt4all-j threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00060d6c0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
10:10AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
10:10AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:45957'
10:10AM DBG GRPC Service state dir: /tmp/go-processmanager2459821549
10:10AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45957: connect: connection refused"
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stdout Server listening on 127.0.0.1:45957
10:10AM DBG GRPC Service Ready
10:10AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:16384 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stderr gguf_init_from_file: invalid magic characters lmgg.
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stderr error loading model: llama_model_loader: failed to load model from /models/ggml-gpt4all-j
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stderr
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stderr llama_load_model_from_file: failed to load model
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45957): stderr llama_init_from_gpt_params: error: failed to load model '/models/ggml-gpt4all-j'
10:10AM DBG [llama-cpp] Fails: could not load model: rpc error: code = Canceled desc =
10:10AM DBG [llama-stable] Attempting to load
10:10AM DBG Loading model llama-stable from ggml-gpt4all-j
10:10AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
10:10AM DBG Loading GRPC Model llama-stable: {backendString:llama-stable model:ggml-gpt4all-j threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00060d6c0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
10:10AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-stable
10:10AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:45383'
10:10AM DBG GRPC Service state dir: /tmp/go-processmanager3031435554
10:10AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45383: connect: connection refused"
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr 2024/01/06 10:10:05 gRPC Server listening at 127.0.0.1:45383
10:10AM DBG GRPC Service Ready
10:10AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:16384 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr create_gpt_params: loading model /models/ggml-gpt4all-j
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr llama.cpp: loading model from /models/ggml-gpt4all-j
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr error loading model: unexpectedly reached end of file
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr llama_load_model_from_file: failed to load model
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr llama_init_from_gpt_params: error: failed to load model '/models/ggml-gpt4all-j'
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45383): stderr load_binding_model: error: unable to load model
10:10AM DBG [llama-stable] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
10:10AM DBG [llama] Attempting to load
10:10AM DBG Loading model llama from ggml-gpt4all-j
10:10AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
10:10AM DBG Loading GRPC Model llama: {backendString:llama model:ggml-gpt4all-j threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00060d6c0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
10:10AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
10:10AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:45727'
10:10AM DBG GRPC Service state dir: /tmp/go-processmanager1207006948
10:10AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45727: connect: connection refused"
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr 2024/01/06 10:10:16 gRPC Server listening at 127.0.0.1:45727
10:10AM DBG GRPC Service Ready
10:10AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:16384 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr create_gpt_params: loading model /models/ggml-gpt4all-j
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr gguf_init_from_file: invalid magic number 67676d6c
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr error loading model: llama_model_loader: failed to load model from /models/ggml-gpt4all-j
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr llama_load_model_from_file: failed to load model
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr llama_init_from_gpt_params: error: failed to load model '/models/ggml-gpt4all-j'
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:45727): stderr load_binding_model: error: unable to load model
10:10AM DBG [llama] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
10:10AM DBG [gpt4all] Attempting to load
10:10AM DBG Loading model gpt4all from ggml-gpt4all-j
10:10AM DBG Loading model in memory from file: /models/ggml-gpt4all-j
10:10AM DBG Loading GRPC Model gpt4all: {backendString:gpt4all model:ggml-gpt4all-j threads:10 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc00060d6c0 externalBackends:map[autogptq:/build/extra/grpc/autogptq/autogptq.py bark:/build/extra/grpc/bark/ttsbark.py diffusers:/build/extra/grpc/diffusers/backend_diffusers.py exllama:/build/extra/grpc/exllama/exllama.py huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py vall-e-x:/build/extra/grpc/vall-e-x/ttsvalle.py vllm:/build/extra/grpc/vllm/backend_vllm.py] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false}
10:10AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
10:10AM DBG GRPC Service for ggml-gpt4all-j will be running at: '127.0.0.1:34513'
10:10AM DBG GRPC Service state dir: /tmp/go-processmanager1993281322
10:10AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34513: connect: connection refused"
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stderr 2024/01/06 10:10:18 gRPC Server listening at 127.0.0.1:34513
10:10AM DBG GRPC Service Ready
10:10AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:ggml-gpt4all-j ContextSize:16384 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:10 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/ggml-gpt4all-j Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 Tokenizer: LoraBase: LoraAdapter: NoMulMatQ:false DraftModel: AudioPath: Quantization:}
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: loading model from '/models/ggml-gpt4all-j' - please wait ...
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_vocab = 50400
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_ctx = 2048
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_embd = 4096
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_head = 16
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_layer = 28
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: n_rot = 64
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: f16 = 2
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: ggml ctx size = 5401.45 MB
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: kv self size = 896.00 MB
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: ................................... done
10:10AM DBG GRPC(ggml-gpt4all-j-127.0.0.1:34513): stdout gptj_model_load: model size = 3609.38 MB / num tensors = 285
10:10AM DBG [gpt4all] Loads OK
10:10AM DBG Response: {"created":1704535772,"object":"chat.completion","id":"bdd5fa1b-35b7-4b56-b25b-95bc3d86af6b","model":"gpt-3.5-turbo-16k-0613","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I'm doing well. How about you?"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[172.17.0.1]:34764 200 - POST /v1/chat/completions
[127.0.0.1]:50926 200 - GET /readyz
[127.0.0.1]:50952 200 - GET /readyz
I have the same issue, I have been battling this for two days without any solution. Clearly there is a bug somewhere given so many users are reporting it. I have this issue with the released avx binaries and Docker on WSL2.
Same issue here.
I followed this simple tutorial to be sure, same error
https://dev.to/worldlinetech/introducing-localai-4gg1
curl $HOST/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "luna-ai-llama2", "messages": [{"role": "user", "content": "Why is the Earth round?"}], "temperature": 0.9 }'
localai-api-1 | 5:27AM INF Loading model 'luna-ai-llama2' greedly from all the available backends: llama-cpp, llama-ggml, llama, gpt4all, gptneox, bert-embeddings, falcon-ggml, gptj, gpt2, dolly, mpt, replit, starcoder, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/vall-e-x/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/exllama2/run.sh
localai-api-1 | 5:27AM INF [llama-cpp] Attempting to load
localai-api-1 | 5:27AM INF Loading model 'luna-ai-llama2' with backend llama-cpp
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44363: connect: connection refused"
localai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:44363: connect: connection refused"
LocalAI version:
V1.21 root@63429046747f:/build# ./local-ai --version LocalAI version 4548473 (4548473acf4f57ff149492272cc1fdba3521f83a) llmai-api-1 | 3:04AM DBG Loading model '
Environment, CPU architecture, OS, and Version: Intel X86 centos
Describe the bug Grpc error
To Reproduce
Expected behavior Output response
Logs openllama7b' greedly llmai-api-1 | 3:04AM DBG [llama] Attempting to load llmai-api-1 | 3:04AM DBG Loading model llama from openllama7b llmai-api-1 | 3:04AM DBG Loading model in memory from file: /models/openllama7b llmai-api-1 | 3:04AM DBG Loading GRPC Model%!(EXTRA string=llama, model.Options={llama openllama7b 4 /tmp/localai/backend_data 0xc0000400b0 0xc000296a20}) llmai-api-1 | 3:04AM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama) llmai-api-1 | 3:04AM DBG GRPC Service for 'llama' (openllama7b) will be running at: 'localhost:37785' llmai-api-1 | 3:04AM DBG GRPC Service Started llmai-api-1 | rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1">127.0.0.1:37785: connect: connection refused" llmai-api-1 | 3:04AM DBG GRPC(llama-openllama7b-localhost:37785): stderr 2023/07/19 03:04:00 gRPC Server listening at 127.0.0.1:37785 llmai-api-1 | 3:04AM DBG GRPC Service Ready llmai-api-1 | 3:04AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:/models/openllama7b ContextSize:512 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:0 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
Additional context