Adding custom finetuned (and converted+quantized) model to Ollama

LrsChrSch commented 4 months ago

Hey there!

First off: Thank you for this amazing model and all the work you put into it. So far this thing is really impressive (especially at this size).

I was messing around with it a bunch for the last couple of days because i have a project in mind. So I used Llava to make a more specific dataset for my task which I then used to fine tune the model. All of that worked pretty much flawlessly thanks to the finetuning notebook.

So now I have a .safetensors model which I can load as usual using the transformers library. So far so good. I wanted the whole project to run on a raspberry pi with 4gb of ram. I did a test before with the base model and the official Ollama integration which also worked really well and didn't run out of ram (yay!)

So the next step would be to create the ggufs and make a custom model file. The create_gguf.py worked to create the two files for the text model and the projector. But here's where I'm currently stuck: I can create a model file (basically a copy of the moondream:v2 model file that i got with ollama show moondream:v2 --modelfile but with the paths changed to use my ggufs), but when I try to run that I get the following error:

llama runner process has terminated: exit status 0xc0000409

I also noticed that my files are much larger than the ones used by ollama (~888mb for the projector and ~2772mb for the text model vs 910mb and 829mb). I'm guessing that ollama is not using the create_gguf.py file or that there is some custom code for quantization and conversion. But idk, there's a very high chance that I'm just missing something very obvious lol.

So I guess my question is: How can I convert my finetuned safetensors model to the right (small) format and add them as a custom model to Ollama?

Hope you can help and thank you so much in advance!

j0yk1ll commented 3 months ago

@LrsChrSch This seems to be a problem with Ollama, multiple people seem to encounter the problem https://github.com/ollama/ollama/issues/4457

For the time being maybe try llama-cpp-python.

Based on their docs (https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models)

from llama_cpp.llama_chat_format import MoondreamChatHandler
chat_handler = MoondreamChatHandler(clip_model_path="path/to/mmproj/mmproj.bin")
llm = Llama(
  model_path="./path/to/model/model.gguf",
  chat_handler=chat_handler,
  n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an assistant who perfectly describes images."},
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
            ]
        }
    ]
)

LrsChrSch commented 3 months ago

Hey there! Thank you so much for your response!

I tried to get it to work using Llama cpp python and it also gave me an error. This one was more descriptive though: error loading model vocabulary: cannot find tokenizer merges in model file

So there was something wrong with my model and I also figured out what it was. The create-gguf.py asks for a tokenizer input which you can just set to "vikhyatk/moondream2" (which I did when I first tried it). This doesn't give an error during conversion as the tokenizer just gets loaded using the huggingface library. But this apparently also means that it doesn't add merges or special tokens to the gguf. The help within the file correctly tells you to give it the path to the directory, which I overread.

So the solution is to download the moondream repository from huggingface and point the create-gguf.py to the directory with all the files for the tokenizer and to the finetuned .safetensors for the model. The resulting .gguf files load perfectly in both llama cpp python and ollama!

The only thing that's missing now is a way to quantize the text model. I'll look into that today and hopefully I can update and close this "issue" then :) If you have any advice on that too, it would be greatly appreciated of course

LrsChrSch commented 3 months ago

Alright I think I can close this. Quantization is made really easy by using the quantize.exe from the normal llama.cpp repository. This also threw an error before but now it doesn't anymore.

Thanks again @j0yk1ll without your suggestion I probably wouldn't have found out what was wrong with it :D

j0yk1ll commented 3 months ago

@LrsChrSch happy that I could help.

If you got the time I think it would be amazing if you could create a small tutorial with some screenshots detailing the steps in a more visual way. I imagine there are more people out there that have an interest in creating their own quantized version and this would be a great help.

LrsChrSch commented 3 months ago

Yeah I can try to do a write-up of the whole process once I'm a little more familiar with how all of this works (and once I have some more time). I think it would be really helpful too because this model can produce really high quality results for single tasks once it is trained on them :)

vikhyat / moondream

Adding custom finetuned (and converted+quantized) model to Ollama #95