mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
24.56k stars 1.88k forks source link

Issue regarding falcon-7b quantized #728

Closed Pablo1107 closed 1 year ago

Pablo1107 commented 1 year ago

LocalAI version: LocalAI version LocalAI v1.20.1-dirty (92614b91d7b2e5ceb4db28c640314df7fec3d96f)

Environment, CPU architecture, OS, and Version: Linux t14s 6.4.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 01 Jul 2023 16:17:21 +0000 x86_64 GNU/Linux

Describe the bug Running LocalAI with falcon7b-instruct.ggmlv3.fp16.bin from TheBloke it is putting me out of memory with 16GB of RAM. So I tried falcon7b-instruct.ggmlv3.q8_0.bin which works with a little bit less of RAM but seg fault the backend.

To Reproduce 1) Download this version of falcon-7b 2) Run a any prompt.

Expected behavior To not seg fault.

Logs

Expand ``` ❯ local-ai --debug Starting LocalAI using 4 threads, with models path: /home/pablo/.local/share/local-ai/models unexpected end of JSON input ┌───────────────────────────────────────────────────┐ │ Fiber v2.47.0 │ │ http://127.0.0.1:8080 │ │ (bound on host 0.0.0.0 and port 8080) │ │ │ │ Handlers ............ 32 Processes ........... 1 │ │ Prefork ....... Disabled PID .............. 8181 │ └───────────────────────────────────────────────────┘ 12:55PM DBG Request received: {"model":"falcon7b-instruct.ggmlv3.q8_0.bin","file":"","language":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"###\nRole name: shell\nProvide only zsh commands for Linux/Arch Linux without any description.\nIf there is a lack of details, provide most logical solution.\nEnsure the output is a valid shell command.\nIf multiple steps required try to combine them together.\n\nRequest: concat two .bin files into one\n###\nCommand:"}],"stream":true,"echo":false,"top_p":1,"top_k":0,"temperature":0.1,"max_tokens":0,"n":0,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"seed":0,"mode":0,"step":0,"typical_p":0} 12:55PM DBG Parameter Config: &{OpenAIRequest:{Model:falcon7b-instruct.ggmlv3.q8_0.bin File: Language: ResponseFormat: Size: Prompt: Instruction: Input: Stop: Messages:[] Stream:false Echo:false TopP:1 TopK:80 Temperature:0.1 Maxtokens:512 N:0 Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 Seed:0 Mode:0 Step:0 TypicalP:0} Name: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 F16:false NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Completion: Chat: Edit:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false PromptStrings:[] InputStrings:[] InputToken:[]} 12:55PM DBG Stream request received [127.0.0.1]:43774 200 - POST /v1/chat/completions 12:55PM DBG Loading model 'falcon7b-instruct.ggmlv3.q8_0.bin' greedly 12:55PM DBG [llama] Attempting to load 12:55PM DBG Loading model llama from falcon7b-instruct.ggmlv3.q8_0.bin 12:55PM DBG Loading model in memory from file: /home/pablo/.local/share/local-ai/models/falcon7b-instruct.ggmlv3.q8_0.bin 12:55PM DBG Sending chunk: {"object":"chat.completion.chunk","model":"falcon7b-instruct.ggmlv3.q8_0.bin","choices":[{"delta":{"role":"assistant"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}} llama.cpp: loading model from /home/pablo/.local/share/local-ai/models/falcon7b-instruct.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 65024 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4544 llama_model_load_internal: n_mult = 71 llama_model_load_internal: n_head = 1 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 7 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 12141 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 7313.92 MB error loading model: llama.cpp: tensor 'tok_embeddings.weight' is missing from model llama_load_model_from_file: failed to load model 12:55PM DBG [llama] Fails: failed loading model 12:55PM DBG [gpt4all] Attempting to load 12:55PM DBG Loading model gpt4all from falcon7b-instruct.ggmlv3.q8_0.bin 12:55PM DBG Loading model in memory from file: /home/pablo/.local/share/local-ai/models/falcon7b-instruct.ggmlv3.q8_0.bin falcon_model_load: loading model from '/home/pablo/.local/share/local-ai/models/falcon7b-instruct.ggmlv3.q8_0.bin' - please wait ... falcon_model_load: n_vocab = 65024 falcon_model_load: n_embd = 4544 falcon_model_load: n_head = 71 falcon_model_load: n_head_kv = 1 falcon_model_load: n_layer = 32 falcon_model_load: ftype = 7 falcon_model_load: qntvr = 0 falcon_model_load: ggml ctx size = 7313.92 MB falcon_model_load: memory_size = 32.00 MB, n_mem = 65536 falcon_model_load: ........................ done falcon_model_load: model size = 7313.87 MB / num tensors = 196 12:55PM DBG [gpt4all] Loads OK fatal error: unexpected signal during runtime execution [signal SIGSEGV: segmentation violation code=0x1 addr=0x80000000027 pc=0xc53820] runtime stack: runtime.throw({0xe263d8?, 0xc49d6b?}) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/panic.go:1047 +0x5d fp=0x7f38c77e5610 sp=0x7f38c77e55e0 pc=0x47b4dd runtime.sigpanic() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7f38c77e5670 sp=0x7f38c77e5610 pc=0x491989 goroutine 51 [syscall]: runtime.cgocall(0xb62770, 0xc0003f1238) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/cgocall.go:157 +0x5c fp=0xc0003f1210 sp=0xc0003f11d8 pc=0x44a2bc github.com/nomic-ai/gpt4all/gpt4all-bindings/golang._Cfunc_model_prompt(0x7f38b840e950, 0x7f38b83fdb90, 0xc0003f8600, 0xa, 0x3f99999a, 0x400, 0x200, 0x50, 0x3f800000, 0x3dcccccd, ...) _cgo_gotypes.go:127 +0x45 fp=0xc0003f1238 sp=0xc0003f1210 pc=0x8ee9c5 github.com/nomic-ai/gpt4all/gpt4all-bindings/golang.(*Model).Predict.func1(0xe172f0?, 0x19?, {0xc0003f8600, 0x4539ea?, 0xc0003e6480?}, {0x400, 0xa, 0x200, 0x50, 0x1, ...}) /home/runner/work/LocalAI/LocalAI/gpt4all/gpt4all-bindings/golang/gpt4all.go:61 +0x185 fp=0xc0003f12e0 sp=0xc0003f1238 pc=0x8ef465 github.com/nomic-ai/gpt4all/gpt4all-bindings/golang.(*Model).Predict(0x0?, {0xc0005f6000, 0x135}, {0xc0003f1508, 0x4, 0xd0?}) /home/runner/work/LocalAI/LocalAI/gpt4all/gpt4all-bindings/golang/gpt4all.go:61 +0x225 fp=0xc0003f1440 sp=0xc0003f12e0 pc=0x8ef0e5 github.com/go-skynet/LocalAI/api.ModelInference.func11() /home/runner/work/LocalAI/LocalAI/api/prediction.go:523 +0x270 fp=0xc0003f1538 sp=0xc0003f1440 pc=0xaa2a30 github.com/go-skynet/LocalAI/api.ModelInference.func14() /home/runner/work/LocalAI/LocalAI/api/prediction.go:585 +0x1aa fp=0xc0003f15f0 sp=0xc0003f1538 pc=0xaa228a github.com/go-skynet/LocalAI/api.ComputeChoices({0xc0005f6000, 0x135}, 0xc0001f0140, 0xc0001d4b00, 0xc0003a81a0?, 0xc000339dd0?, 0x1579460, 0xc0000123c0?) /home/runner/work/LocalAI/LocalAI/api/prediction.go:609 +0x246 fp=0xc0003f1eb0 sp=0xc0003f15f0 pc=0xaa55e6 github.com/go-skynet/LocalAI/api.chatEndpoint.func1({0xc0005f6000, 0x135}, 0xc0001f0140, 0xd10b60?, 0xc0001bed20?, 0xc000182240) /home/runner/work/LocalAI/LocalAI/api/openai.go:357 +0x1db fp=0xc0003f1fa0 sp=0xc0003f1eb0 pc=0xa9bd3b github.com/go-skynet/LocalAI/api.chatEndpoint.func2.3() /home/runner/work/LocalAI/LocalAI/api/openai.go:428 +0x3f fp=0xc0003f1fe0 sp=0xc0003f1fa0 pc=0xa9bb1f runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0003f1fe8 sp=0xc0003f1fe0 pc=0x4ad401 created by github.com/go-skynet/LocalAI/api.chatEndpoint.func2 /home/runner/work/LocalAI/LocalAI/api/openai.go:428 +0x7f1 goroutine 1 [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc0001271d8 sp=0xc0001271b8 pc=0x47e236 runtime.netpollblock(0xc000127220?, 0x44994f?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/netpoll.go:527 +0xf7 fp=0xc000127210 sp=0xc0001271d8 pc=0x476a37 internal/poll.runtime_pollWait(0x7f38e1624df8, 0x72) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/netpoll.go:306 +0x89 fp=0xc000127230 sp=0xc000127210 pc=0x4a7ca9 internal/poll.(*pollDesc).wait(0xc0001fc480?, 0x1001272a0?, 0x0) /opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc000127258 sp=0xc000127230 pc=0x5254f2 internal/poll.(*pollDesc).waitRead(...) /opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0001fc480) /opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_unix.go:614 +0x2bd fp=0xc000127300 sp=0xc000127258 pc=0x52adfd net.(*netFD).accept(0xc0001fc480) /opt/hostedtoolcache/go/1.20.5/x64/src/net/fd_unix.go:172 +0x35 fp=0xc0001273b8 sp=0xc000127300 pc=0x5ad2b5 net.(*TCPListener).accept(0xc0001a2660) /opt/hostedtoolcache/go/1.20.5/x64/src/net/tcpsock_posix.go:148 +0x25 fp=0xc0001273e0 sp=0xc0001273b8 pc=0x5c3665 net.(*TCPListener).Accept(0xc0001a2660) /opt/hostedtoolcache/go/1.20.5/x64/src/net/tcpsock.go:297 +0x3d fp=0xc000127410 sp=0xc0001273e0 pc=0x5c275d github.com/valyala/fasthttp.acceptConn(0xc0003a4200, {0x1628f70, 0xc0001a2660}, 0xc000127608) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/server.go:1928 +0x62 fp=0xc0001274f0 sp=0xc000127410 pc=0x80f562 github.com/valyala/fasthttp.(*Server).Serve(0xc0003a4200, {0x1628f70?, 0xc0001a2660}) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/server.go:1821 +0x4f4 fp=0xc000127638 sp=0xc0001274f0 pc=0x80eb74 github.com/gofiber/fiber/v2.(*App).Listen(0xc0001dd680, {0xe03ccd?, 0x7?}) /home/runner/go/pkg/mod/github.com/gofiber/fiber/v2@v2.47.0/listen.go:88 +0x11d fp=0xc000127698 sp=0xc000127638 pc=0x8a5a5d main.main.func1(0xc0003a6160?) /home/runner/work/LocalAI/LocalAI/main.go:161 +0x825 fp=0xc000127950 sp=0xc000127698 pc=0xad4845 github.com/urfave/cli/v2.(*Command).Run(0xc0003a6160, 0xc0001def00, {0xc0001aa000, 0x2, 0x2}) /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/command.go:274 +0x9eb fp=0xc000127bf0 sp=0xc000127950 pc=0xac190b github.com/urfave/cli/v2.(*App).RunContext(0xc0003a2000, {0x1629478?, 0xc000198030}, {0xc0001aa000, 0x2, 0x2}) /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/app.go:332 +0x616 fp=0xc000127c60 sp=0xc000127bf0 pc=0xabe236 github.com/urfave/cli/v2.(*App).Run(...) /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/app.go:309 main.main() /home/runner/work/LocalAI/LocalAI/main.go:165 +0x12b6 fp=0xc000127f80 sp=0xc000127c60 pc=0xad3f56 runtime.main() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:250 +0x207 fp=0xc000127fe0 sp=0xc000127f80 pc=0x47de07 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000127fe8 sp=0xc000127fe0 pc=0x4ad401 goroutine 2 [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000084fb0 sp=0xc000084f90 pc=0x47e236 runtime.goparkunlock(...) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:387 runtime.forcegchelper() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:305 +0xb0 fp=0xc000084fe0 sp=0xc000084fb0 pc=0x47e070 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x4ad401 created by runtime.init.6 /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:293 +0x25 goroutine 3 [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000085780 sp=0xc000085760 pc=0x47e236 runtime.goparkunlock(...) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:387 runtime.bgsweep(0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgcsweep.go:319 +0xde fp=0xc0000857c8 sp=0xc000085780 pc=0x46a33e runtime.gcenable.func1() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:178 +0x26 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x45f586 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x4ad401 created by runtime.gcenable /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:178 +0x6b goroutine 4 [GC scavenge wait]: runtime.gopark(0x17f724bd142?, 0x3ba297ad?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000085f70 sp=0xc000085f50 pc=0x47e236 runtime.goparkunlock(...) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:387 runtime.(*scavengerState).park(0x1b543a0) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000085fa0 sp=0xc000085f70 pc=0x4681f3 runtime.bgscavenge(0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgcscavenge.go:633 +0x65 fp=0xc000085fc8 sp=0xc000085fa0 pc=0x4687e5 runtime.gcenable.func2() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:179 +0x26 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x45f526 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x4ad401 created by runtime.gcenable /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:179 +0xaa goroutine 18 [finalizer wait]: runtime.gopark(0x1a0?, 0x1b55080?, 0xa0?, 0x61?, 0xc000084770?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000084628 sp=0xc000084608 pc=0x47e236 runtime.runfinq() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084628 pc=0x45e5c7 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x4ad401 created by runtime.createfing /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mfinal.go:163 +0x45 goroutine 19 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000080750 sp=0xc000080730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000807e0 sp=0xc000080750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000807e8 sp=0xc0000807e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 5 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000086750 sp=0xc000086730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000867e0 sp=0xc000086750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000867e8 sp=0xc0000867e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 20 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000080f50 sp=0xc000080f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000080fe0 sp=0xc000080f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000080fe8 sp=0xc000080fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 6 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000086f50 sp=0xc000086f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000086fe0 sp=0xc000086f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000086fe8 sp=0xc000086fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 21 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000081750 sp=0xc000081730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000817e0 sp=0xc000081750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000817e8 sp=0xc0000817e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 22 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000081f50 sp=0xc000081f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000081fe0 sp=0xc000081f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000081fe8 sp=0xc000081fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 34 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000114750 sp=0xc000114730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0001147e0 sp=0xc000114750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001147e8 sp=0xc0001147e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 23 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000082750 sp=0xc000082730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000827e0 sp=0xc000082750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 35 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000114f50 sp=0xc000114f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000114fe0 sp=0xc000114f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000114fe8 sp=0xc000114fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 36 [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000115750 sp=0xc000115730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0001157e0 sp=0xc000115750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001157e8 sp=0xc0001157e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 7 [GC worker (idle)]: runtime.gopark(0x17f724a65e0?, 0x3?, 0x6d?, 0xb0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000087750 sp=0xc000087730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000877e0 sp=0xc000087750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 24 [GC worker (idle)]: runtime.gopark(0x17f72452981?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000082f50 sp=0xc000082f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000082fe0 sp=0xc000082f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000082fe8 sp=0xc000082fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 37 [GC worker (idle)]: runtime.gopark(0x17f724125fb?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000115f50 sp=0xc000115f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000115fe0 sp=0xc000115f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000115fe8 sp=0xc000115fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 25 [GC worker (idle)]: runtime.gopark(0x17f724482d8?, 0x1?, 0x8f?, 0x4c?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000083750 sp=0xc000083730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0000837e0 sp=0xc000083750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000837e8 sp=0xc0000837e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 38 [GC worker (idle)]: runtime.gopark(0x17f72412225?, 0x1?, 0xd3?, 0xbe?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000116750 sp=0xc000116730 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc0001167e0 sp=0xc000116750 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001167e8 sp=0xc0001167e0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 8 [GC worker (idle)]: runtime.gopark(0x17f72448671?, 0x0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000087f50 sp=0xc000087f30 pc=0x47e236 runtime.gcBgMarkWorker() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1275 +0xf1 fp=0xc000087fe0 sp=0xc000087f50 pc=0x4612f1 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000087fe8 sp=0xc000087fe0 pc=0x4ad401 created by runtime.gcBgMarkStartWorkers /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/mgc.go:1199 +0x25 goroutine 26 [select]: runtime.gopark(0xc0001126b0?, 0x2?, 0x0?, 0x0?, 0xc000112674?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000093c20 sp=0xc000093c00 pc=0x47e236 runtime.selectgo(0xc000093eb0, 0xc000112670, 0x0?, 0x0, 0x0?, 0x1) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/select.go:327 +0x7be fp=0xc000093d60 sp=0xc000093c20 pc=0x48ddbe github.com/go-skynet/LocalAI/api.(*galleryApplier).start.func1() /home/runner/work/LocalAI/LocalAI/api/gallery.go:78 +0xee fp=0xc000093fe0 sp=0xc000093d60 pc=0xa9718e runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000093fe8 sp=0xc000093fe0 pc=0x4ad401 created by github.com/go-skynet/LocalAI/api.(*galleryApplier).start /home/runner/work/LocalAI/LocalAI/api/gallery.go:76 +0xaa goroutine 27 [sleep]: runtime.gopark(0x182f0ca74b7?, 0xc0001ac3f0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000112f00 sp=0xc000112ee0 pc=0x47e236 time.Sleep(0x12a05f200) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195 +0x135 fp=0xc000112f40 sp=0xc000112f00 pc=0x4aa275 github.com/valyala/fasthttp.(*FS).initRequestHandler.func1() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fs.go:482 +0x13c fp=0xc000112fe0 sp=0xc000112f40 pc=0x7da75c runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000112fe8 sp=0xc000112fe0 pc=0x4ad401 created by github.com/valyala/fasthttp.(*FS).initRequestHandler /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fs.go:459 +0x4d6 goroutine 28 [sleep]: runtime.gopark(0x182f0cb4066?, 0xc0001ac8c0?, 0x0?, 0x0?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000113700 sp=0xc0001136e0 pc=0x47e236 time.Sleep(0x12a05f200) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195 +0x135 fp=0xc000113740 sp=0xc000113700 pc=0x4aa275 github.com/valyala/fasthttp.(*FS).initRequestHandler.func1() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fs.go:482 +0x13c fp=0xc0001137e0 sp=0xc000113740 pc=0x7da75c runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001137e8 sp=0xc0001137e0 pc=0x4ad401 created by github.com/valyala/fasthttp.(*FS).initRequestHandler /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fs.go:459 +0x4d6 goroutine 29 [sleep]: runtime.gopark(0x181c6b7c637?, 0xc000113f88?, 0xc5?, 0xd5?, 0xc0001bed50?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000113f58 sp=0xc000113f38 pc=0x47e236 time.Sleep(0x2540be400) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195 +0x135 fp=0xc000113f98 sp=0xc000113f58 pc=0x4aa275 github.com/valyala/fasthttp.(*workerPool).Start.func2() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/workerpool.go:67 +0x56 fp=0xc000113fe0 sp=0xc000113f98 pc=0x81c056 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000113fe8 sp=0xc000113fe0 pc=0x4ad401 created by github.com/valyala/fasthttp.(*workerPool).Start /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/workerpool.go:59 +0xdd goroutine 50 [select]: runtime.gopark(0xc000123a08?, 0x3?, 0x34?, 0x0?, 0xc0001239da?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc000123860 sp=0xc000123840 pc=0x47e236 runtime.selectgo(0xc000123a08, 0xc0001239d4, 0x5ab9a9?, 0x0, 0xc0000c3000?, 0x1) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/select.go:327 +0x7be fp=0xc0001239a0 sp=0xc000123860 pc=0x48ddbe github.com/valyala/fasthttp/fasthttputil.(*pipeConn).readNextByteBuffer(0xc0001f0958, 0x1) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fasthttputil/pipeconns.go:188 +0x1b3 fp=0xc000123a48 sp=0xc0001239a0 pc=0x7ccd73 github.com/valyala/fasthttp/fasthttputil.(*pipeConn).read(0xc0001f0958, {0xc0000c6000, 0x1000, 0xc0001a2bb8?}, 0x0?) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fasthttputil/pipeconns.go:165 +0x3a fp=0xc000123a78 sp=0xc000123a48 pc=0x7ccaba github.com/valyala/fasthttp/fasthttputil.(*pipeConn).Read(0x1a94880?, {0xc0000c6000?, 0xc4?, 0x1000?}) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/fasthttputil/pipeconns.go:148 +0x88 fp=0xc000123af8 sp=0xc000123a78 pc=0x7cc9a8 github.com/valyala/fasthttp.writeBodyChunked(0xc000194930?, {0x7f38e0dceb20, 0xc0001f0958}) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/http.go:2062 +0x95 fp=0xc000123b68 sp=0xc000123af8 pc=0x807bd5 github.com/valyala/fasthttp.(*Response).writeBodyStream(0xc000194930, 0xc000123c48?, 0x1) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/http.go:1974 +0x1f1 fp=0xc000123be0 sp=0xc000123b68 pc=0x807431 github.com/valyala/fasthttp.(*Response).Write(0xc0000c3000?, 0x1625260?) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/http.go:1875 +0x157 fp=0xc000123c38 sp=0xc000123be0 pc=0x8070b7 github.com/valyala/fasthttp.writeResponse(0xc000194600?, 0x1aa7868?) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/server.go:2575 +0x5b fp=0xc000123c58 sp=0xc000123c38 pc=0x8126fb github.com/valyala/fasthttp.(*Server).serveConn(0xc0003a4200, {0x162c658?, 0xc0005c6008}) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/server.go:2416 +0x1667 fp=0xc000123ec8 sp=0xc000123c58 pc=0x811527 github.com/valyala/fasthttp.(*Server).serveConn-fm({0x162c658?, 0xc0005c6008?}) :1 +0x39 fp=0xc000123ef0 sp=0xc000123ec8 pc=0x820959 github.com/valyala/fasthttp.(*workerPool).workerFunc(0xc0001bed20, 0xc000036020) /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/workerpool.go:224 +0xa9 fp=0xc000123fa0 sp=0xc000123ef0 pc=0x81cb89 github.com/valyala/fasthttp.(*workerPool).getCh.func1() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/workerpool.go:196 +0x38 fp=0xc000123fe0 sp=0xc000123fa0 pc=0x81c8f8 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000123fe8 sp=0xc000123fe0 pc=0x4ad401 created by github.com/valyala/fasthttp.(*workerPool).getCh /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/workerpool.go:195 +0x1b0 goroutine 52 [chan receive]: runtime.gopark(0x4b7c25?, 0x1a94400?, 0xa0?, 0x2b?, 0xc0003fe000?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc0003f5d08 sp=0xc0003f5ce8 pc=0x47e236 runtime.chanrecv(0xc000182240, 0xc0003f5f10, 0x1) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/chan.go:583 +0x49d fp=0xc0003f5d98 sp=0xc0003f5d08 pc=0x44d07d runtime.chanrecv2(0xc0005e2200?, 0xc0005e2200?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/chan.go:447 +0x18 fp=0xc0003f5dc0 sp=0xc0003f5d98 pc=0x44cbb8 github.com/go-skynet/LocalAI/api.chatEndpoint.func2.1(0x0?) /home/runner/work/LocalAI/LocalAI/api/openai.go:432 +0xc5 fp=0xc0003f5fa0 sp=0xc0003f5dc0 pc=0xa9b745 github.com/valyala/fasthttp.NewStreamReader.func1() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/stream.go:44 +0x38 fp=0xc0003f5fe0 sp=0xc0003f5fa0 pc=0x814b18 runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0003f5fe8 sp=0xc0003f5fe0 pc=0x4ad401 created by github.com/valyala/fasthttp.NewStreamReader /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/stream.go:43 +0x37c goroutine 53 [sleep]: runtime.gopark(0x1840621e730?, 0xd1cb00?, 0x98?, 0x27?, 0x0?) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/proc.go:381 +0xd6 fp=0xc0005dff88 sp=0xc0005dff68 pc=0x47e236 time.Sleep(0x3b9aca00) /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/time.go:195 +0x135 fp=0xc0005dffc8 sp=0xc0005dff88 pc=0x4aa275 github.com/valyala/fasthttp.updateServerDate.func1() /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/header.go:2274 +0x1e fp=0xc0005dffe0 sp=0xc0005dffc8 pc=0x81cfde runtime.goexit() /opt/hostedtoolcache/go/1.20.5/x64/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0005dffe8 sp=0xc0005dffe0 pc=0x4ad401 created by github.com/valyala/fasthttp.updateServerDate /home/runner/go/pkg/mod/github.com/valyala/fasthttp@v1.48.0/header.go:2272 +0x25 ```

Additional context

yunghoy commented 1 year ago

Same here. I think this is not the segmentation issue is not related to the memory size. I have 64 GB ram and the docker container can consume a half of the memory. The segmentation issue is happening on falcon-7b model

Pablo1107 commented 1 year ago

Same here. I think this is not the segmentation issue is not related to the memory size. I have 64 GB ram and the docker container can consume a half of the memory. The segmentation issue is happening on falcon-7b model

What exact file are you using?

yunghoy commented 1 year ago

Tried all Bloke repository files and gpt4all-falcon file. I think MPT and Falcon models do not work. GPT4ALL is working. I think this Github repository is not maintained properly.

Obviously, we can only use MPT or Falcon but cannot use llama nor gpt4all due to license issue. Now talking about llama and gpt4all under K8S is meaningless. Since these llama and gpt4all models are only for your personal work or research, there will be no use of K8S. :p

mudler commented 1 year ago

Tried all Bloke repository files and gpt4all-falcon file. I think MPT and Falcon models do not work. GPT4ALL is working. I think this Github repository is not maintained properly.

Please file issues for the problems you find - this is how it works. If you keep the things that work or not by yourself things will never get fixed. This is a community, open source project - so everyone is trying to help each other here!

Obviously, we can only use MPT or Falcon but cannot use llama nor gpt4all due to license issue. Now talking about llama and gpt4all under K8S is meaningless. Since these llama and gpt4all models are only for your personal work or research, there will be no use of K8S. :p

You are wrong here, there are OpenLLama based models that can be used freely, and gpt4all models based on GPT-J. MPT with gpt4all should work.


I didn't tried Falcon neither MPT recently, as I'm busy with #726 , but I think the model you are trying is not the one I've tried it - that looks somewhat newer.

bnusunny commented 1 year ago

@mudler Thanks for building this great project. Could you share the Falcon 7B model file you tested with (#516)? This will unblock us to use Falcon with this nice tool.

mudler commented 1 year ago

I had a quick look at the current state and seems most of the work to support falcon went to ggllm.cpp. I quickly give a shot at creating bindings and seems to work with wizardlm-uncensored: https://github.com/mudler/go-ggllm.cpp - I will integrate it in LocalAI soon, that should give support for 7b and 40b at least and GPU support

mudler commented 1 year ago

I'm having a closer look at it this weekend, a spare attempt seems to work here with falcon-7b. I'm looking into refactoring the backends first to get rid of some hacks, but this shouldn't take long.

mudler commented 1 year ago

Now master should have falcon working. I've been trying locally with : https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML/tree/main .

I've also kept the old ggml implementation as a fallback in the falcon-ggml backend

Note: you need to be extra-careful to have a matching prompt. Without it the model hallucinates pretty quickly