Falcon 40B. Is any way to run with LocalAI?

netandreus commented 7 months ago

Is there any way to run Falcon 40B model with LocalAI? I'm trying this models:

and this backends:

without mentioned (default)
backend: falcon
backjend: falcon-ggml

without any success.

My model config.yml file:

context_size: 2000
f16: true
gpu_layers: 1
name: wizardlm-uncensored-falcon-40b
parameters:
  model: wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin
  temperature: 0.9
  top_k: 40
  top_p: 0.65
embeddings: true

Model = wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin, backend = falcon

Error: stderr error loading model: falcon.cpp: tensor 'transformer.word_embeddings.weight' has wrong shape; expected  8192 x 65024, got  8192 x 65025

Model = wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin, backend = falcon-ggml

Error: stderr falcon_model_load: invalid model file '/Users/andrey/sandbox/llm/current/models/wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin' (bad magic)

Model = wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin, backend line commented

Error: stderr gguf_init_from_file: invalid magic number 67676363

Model = falcon-40b-instruct.ggccv1.q4_0.bin, backend = falcon

Error: stderr -[MTLComputePipelineDescriptorInternal setComputeFunction:withType:]:692: failed assertion `computeFunction must not be nil.'

Model = falcon-40b-instruct.ggccv1.q4_0.bin, backend = falcon-ggml

Error: stderr falcon_model_load: invalid model file '/Users/andrey/sandbox/llm/current/models/falcon-40b-instruct.ggccv1.q4_0.bin' (bad magic)

Model = falcon-40b-instruct.ggccv1.q4_0.bin, without backend in config

Error: stderr error loading model: unknown (magic, version) combination: 67676363, 0000000a; is this really a GGML file?

LocalAI version:

commit 8034ed3473fb1c8c6f5e3864933c442b377be52e (HEAD -> master, origin/master, origin/HEAD)
Author: Jesús Espino <jespinog@gmail.com>
Date:   Sun Oct 15 09:17:41 2023 +0200

Environment, CPU architecture, OS, and Version:

MacOS Ventura 13.5.2 (22G91)
Apple Silicon M2

Describe the bug

Errors when try to chat with model

To Reproduce curl:

curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--data '{
    "model": "falcon-40b-instruct-q4",
    "messages": [
        {
            "role": "user",
            "content": "What is the capital of UAE?"
        }
    ],
    "temperature": 0.9,
    "top_p": 1,
    "n": 1,
    "stream": false,
    "max_tokens": 250,
    "presence_penalty": 0,
    "frequency_penalty": 0
}'

Expected behavior Some response.

Logs Full logs:

falcon_40b_local_ai_full_logs.txt

localai-bot commented 7 months ago

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

ERROR: The prompt size exceeds the context window size and cannot be processed.

Sources:

netandreus commented 7 months ago

@mudler do you have any ideas, what is wrong?

mudler commented 7 months ago

@netandreus did you tried with falcon/gguf files? GGML are quite outdated and old now. That should be working with the default llama-cpp backend as for now. Also, which version of LocalAI are you trying this with?

mudler / LocalAI

Falcon 40B. Is any way to run with LocalAI? #1353

:warning::warning::warning::warning::warning:

:warning::warning::warning::warning::warning: