create-gguf.py and adding the model to ollama

LrsChrSch commented 3 months ago

Hey there! Had an issue a few months ago about not being able to quantize the model in order to run it using ollama.

Figured it out back then and all was good.

Today I wanted to train a different model with the 'new' (3 months old) finetuning script and the new model revision. Finetuning worked like a charm. Got a model out of it which I was happy with.

Now to add it to ollama: I used the same approach as last time, which is create-gguf.py using the model.safetensors and the path to the directory of the tokenizer (git cloned the huggingface repo and pointed it there).

The resulting ggufs also look fine to me. One is about 910kb and the other 2770kb.

My Modelfile now looks like this (didn't want to quantize to Q4_0 directly and also saw that ollama also has a --quantize feature built in)

FROM ./moondream2-mmproj-f16.gguf
FROM ./moondream2-text-model-f16.gguf

PARAMETER stop <|endoftext|>

Using ollama create [modelname] with this works just fine.

Running the model using ollama run [modelname] prints the error Error: llama runner process has terminated: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

Displaying the model info using ollama show [modelname] shows this:

Model
        arch                    phi2
        parameters              463.89M
        quantization            F16
        context length          2048
        embedding length        2048

Projector
        arch                            clip
        parameters                      463.89M
        projector type                  mlp
        embedding length                1152
        projection dimensionality       2048

Parameters
        stop            "<|endoftext|>"
        temperature     0

License
        Apache License
        Version 2.0, January 2004

This is different to the official moondream:v2 Modelfile.

Model
        arch                    phi2
        parameters              1B
        quantization            Q4_0
        context length          2048
        embedding length        2048

Projector
        arch                            clip
        parameters                      454.45M
        projector type                  mlp
        embedding length                1152
        projection dimensionality       2048

Parameters
        stop            "<|endoftext|>"
        stop            "Question:"
        temperature     0

License
        Apache License
        Version 2.0, January 2004

Notice the different parameter sizes. It seems like Ollama reads the same model in twice for both the projector and the text model.

I don't know if this is an issue with create-gguf (because it does seem to do what it's supposed to) or if it's an issue with ollama.

Hope you can help and thank you so much in advance! If you need more info, i'll be happy to help. I'll also try poking around a bit more and see if I can find the issue myself.

LrsChrSch commented 3 months ago

Small update here:

Using model revision '2024-05-08' to train a model with this same workflow works perfectly fine. It seems like ollama cannot handle the newer revisions of this model.

vikhyat commented 3 months ago

The change we made to support higher resolution images hasn't been ported to llama.cpp/ollama yet - https://github.com/vikhyat/moondream/commit/ffbf8228aca7138fb55cee2119237d433f8431e2

omarnahdi commented 3 weeks ago

@LrsChrSch Can you push the model to ollama ? because the latest version over there doesn't seem to work

vikhyat / moondream

create-gguf.py and adding the model to ollama #131