vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
5.14k stars 447 forks source link

Please consider creating a quantized version or even better a CoreML model #26

Open ktsi opened 9 months ago

ktsi commented 9 months ago

Hi. Thanks for this great small model! Is it in your plans to provide a ggml version? A coreML version?

vikhyat commented 9 months ago

Would absolutely love a community contribution here, I’m currently focused on training the next version so I won’t be able to work on quantizing until after that.

ktsi commented 9 months ago

If you can point at the general direction I could try to mess with it. Noob here, unfortunately.

CyberTimon commented 9 months ago

I'm experimenting with quants today. Will keep you guys updated.

CyberTimon commented 9 months ago

I've tried to integrate this model with transformers but couldn't manage to correctly implement the image embeddings. Text was generating successfully even quantized.

The model would actually be useful if we can use it quantized - otherwise it still uses way too much vram so a quantized bigger model is still better. I really hope there is soon some progress on quantization or llama.cpp integration

sujitvasanth commented 9 months ago

Hi I've got it working with transformers so you can show it items on webcam and it will automatically detect on scene change ... my prompt is "Ignore mouse, pad, mat, desk; describe central object in 15 words. - here my code Screenshot from 2024-01-28 14-16-41

import tkinter as tk
from tkinter import messagebox
from PIL import Image, ImageTk
from moondream import VisionEncoder, TextModel
from huggingface_hub import snapshot_download
from threading import Thread
from transformers import TextIteratorStreamer
import cv2
import numpy as np
import re
import time

model_running = False
prev_frame = None
change_detected = False
stabilization_time = 1200  # Time to wait for stabilization in milliseconds
last_change_time = 0
change_threshold =50000
change_sensitivity = 10

# Download and load the model outside of the function
model_path = snapshot_download("vikhyatk/moondream1")
vision_encoder = VisionEncoder(model_path) 
text_model = TextModel(model_path) 

# Initialize webcam capture
cap = cv2.VideoCapture(0)

# Model inference function
def moondream(prompt, img=None, vision_encoder=vision_encoder, text_model=text_model):
    global model_running
    model_running=True
    if img is None:
        # Capture a frame from the webcam
        ret, frame = cap.read()
        if not ret:
            return "Failed to capture image from the webcam."
        # Convert the frame to a PIL image
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

    image_embeds = vision_encoder(img)
    streamer = TextIteratorStreamer(text_model.tokenizer, skip_special_tokens=True)
    generation_kwargs = dict(
        image_embeds=image_embeds, question=prompt, streamer=streamer
    )
    thread = Thread(target=text_model.answer_question, kwargs=generation_kwargs)
    thread.start()

    buffer = ""
    for new_text in streamer:
        if not new_text.endswith("<") and not new_text.endswith("END"):
            buffer += new_text
            update_output(buffer)
        else:
            new_text = re.sub("<$", "", re.sub("END$", "", new_text))
            buffer += new_text
            update_output(buffer)
            break
    model_running=False

# Update output in the GUI
def update_output(text):
    output_text.delete('1.0', tk.END)
    output_text.insert(tk.END, text)

# Function to handle button click
def on_submit():
    prompt_text = prompt_entry.get("1.0", "end-1c")
    # Run the moondream function in a separate thread
    if not model_running:
        Thread(target=moondream, args=(prompt_text,)).start()

# Function to update the image on the label
def update_image():
    global prev_frame, change_detected, last_change_time
    ret, frame = cap.read()
    if ret:
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (480, 340))
        # Convert the image to RGB
        cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        img = Image.fromarray(cv2image)
        imgtk = ImageTk.PhotoImage(image=img)
        webcam_label.imgtk = imgtk
        webcam_label.config(image=imgtk)

        current_time = int(round(time.time() * 1000))

        # Check for scene change
        if prev_frame is not None:
            change = detect_change(prev_frame, frame)
            #print(change)
            if change > change_threshold:
                change_detected = True
                last_change_time = current_time
            elif change_detected and (current_time - last_change_time > stabilization_time):
                change_detected = False
                on_submit()

        prev_frame = frame
        webcam_label.after(100, update_image)

def detect_change(frame1, frame2):
    diff = cv2.absdiff(frame1, frame2)
    non_zero_count = np.sum(diff > change_sensitivity)
    return non_zero_count

root = tk.Tk()  # Tkinter GUI setup
root.title("Moondream")
root.geometry("800x380")  # You can adjust the size as needed

# Webcam image label
webcam_label = tk.Label(root)
webcam_label.place(x=10, y=10, width=480, height=340)

# Prompt label and entry
prompt_label = tk.Label(root, text="Prompt")
prompt_label.place(x=500, y=10)
prompt_entry = tk.Text(root, height=3, width=40, font=("Helvetica", 10), wrap=tk.WORD)
prompt_entry.place(x=500, y=30)

# Submit button
submit_button = tk.Button(root, text="Submit", command=on_submit)
submit_button.place(x=721, y=75)

# Output label and text area
output_label = tk.Label(root, text="Response")
output_label.place(x=500, y=95)
output_text = tk.Text(root, height=10, width=40, font=("Helvetica", 10), wrap=tk.WORD)
output_text.place(x=500, y=115)
# Start capturing and updating the image
update_image()

# Start the GUI
root.mainloop()

would be great to have a quantized version!!

Yazorp commented 9 months ago

While not a full quantized version, you can cut the memory usage almost in half by changing from float32 to float16 (with no noticeable loss in performance).

In text_model.py add dtype as follows:

        self.model = load_checkpoint_and_dispatch(
            self.model,
            f"{model_path}/text_model.pt",
            device_map={"": self.device.type},
            dtype=torch.float16
        )

In vision_encoder.py change all references from torch.float32 to torch.float16

sujitvasanth commented 9 months ago

@Yazorp presumably that gives a performance speed advantage? can you quantify it?

also I have posted a model request for TheBloke's discord server to quantize the model. Please upvote the request here https://discord.com/channels/1111983596572520458/1201207189507932260

Yazorp commented 9 months ago

In theory, if the GPU supports FP16 processing then it should also increase performance somewhat (It generally depends on the driver / CUDA implementation). Note that the change to use FP16 instead of FP32 will likely not work on CPU only devices so a branching check for CPU device could be a better approach. INT4 / INT8 quantization should work for both and potentially be quite a bit faster.

vikhyat commented 9 months ago

That's right - I had it default to fp32 because it doesn't work on CPU with the current PyTorch implementation. The model itself was trained in 16 bit performance so you won't see any quality loss running in fp16.

sujitvasanth commented 8 months ago

@vikhyat thanks for moving the custom model code to hugging face.. as you rightly say much better and lighter weight... I was able to almost switch in and out of TinyLlama models and your own - see here https://github.com/sujitvasanth/TinyLlava-Tk and https://github.com/sujitvasanth/TinyLlava-Tk/blob/main/moondreamTk.py (your model) Now it is safetensors/huggingface I tried 4 bit quantisation with bits and bytes... it doesn't come up with any errors.. but in quantized model the text results are confused. It is in normal English sentences but it is hallucinating the objects which are not related to the images now. How can I fix this so can finally try quatisation? TinyLlava unquantized takes 4.5Gb of GPU while your model takes 9Gb despite the tensor file being 3.5Gb but yours is a little faster..

Also currently hugging face inference seems to default to CPU so had to add .to(0) to the model creation to get on GPU.

oliverbob commented 8 months ago

Hi I've got it working with transformers so you can show it items on webcam and it will automatically detect on scene change ... my prompt is "Ignore mouse, pad, mat, desk; describe central object in 15 words. - here my code Screenshot from 2024-01-28 14-16-41

import tkinter as tk
from tkinter import messagebox
from PIL import Image, ImageTk
from moondream import VisionEncoder, TextModel
from huggingface_hub import snapshot_download
from threading import Thread
from transformers import TextIteratorStreamer
import cv2
import numpy as np
import re
import time

model_running = False
prev_frame = None
change_detected = False
stabilization_time = 1200  # Time to wait for stabilization in milliseconds
last_change_time = 0
change_threshold =50000
change_sensitivity = 10

# Download and load the model outside of the function
model_path = snapshot_download("vikhyatk/moondream1")
vision_encoder = VisionEncoder(model_path) 
text_model = TextModel(model_path) 

# Initialize webcam capture
cap = cv2.VideoCapture(0)

# Model inference function
def moondream(prompt, img=None, vision_encoder=vision_encoder, text_model=text_model):
    global model_running
    model_running=True
    if img is None:
        # Capture a frame from the webcam
        ret, frame = cap.read()
        if not ret:
            return "Failed to capture image from the webcam."
        # Convert the frame to a PIL image
        img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

    image_embeds = vision_encoder(img)
    streamer = TextIteratorStreamer(text_model.tokenizer, skip_special_tokens=True)
    generation_kwargs = dict(
        image_embeds=image_embeds, question=prompt, streamer=streamer
    )
    thread = Thread(target=text_model.answer_question, kwargs=generation_kwargs)
    thread.start()

    buffer = ""
    for new_text in streamer:
        if not new_text.endswith("<") and not new_text.endswith("END"):
            buffer += new_text
            update_output(buffer)
        else:
            new_text = re.sub("<$", "", re.sub("END$", "", new_text))
            buffer += new_text
            update_output(buffer)
            break
    model_running=False

# Update output in the GUI
def update_output(text):
    output_text.delete('1.0', tk.END)
    output_text.insert(tk.END, text)

# Function to handle button click
def on_submit():
    prompt_text = prompt_entry.get("1.0", "end-1c")
    # Run the moondream function in a separate thread
    if not model_running:
        Thread(target=moondream, args=(prompt_text,)).start()

# Function to update the image on the label
def update_image():
    global prev_frame, change_detected, last_change_time
    ret, frame = cap.read()
    if ret:
        frame = cv2.flip(frame, 1)
        frame = cv2.resize(frame, (480, 340))
        # Convert the image to RGB
        cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        img = Image.fromarray(cv2image)
        imgtk = ImageTk.PhotoImage(image=img)
        webcam_label.imgtk = imgtk
        webcam_label.config(image=imgtk)

        current_time = int(round(time.time() * 1000))

        # Check for scene change
        if prev_frame is not None:
            change = detect_change(prev_frame, frame)
            #print(change)
            if change > change_threshold:
                change_detected = True
                last_change_time = current_time
            elif change_detected and (current_time - last_change_time > stabilization_time):
                change_detected = False
                on_submit()

        prev_frame = frame
        webcam_label.after(100, update_image)

def detect_change(frame1, frame2):
    diff = cv2.absdiff(frame1, frame2)
    non_zero_count = np.sum(diff > change_sensitivity)
    return non_zero_count

root = tk.Tk()  # Tkinter GUI setup
root.title("Moondream")
root.geometry("800x380")  # You can adjust the size as needed

# Webcam image label
webcam_label = tk.Label(root)
webcam_label.place(x=10, y=10, width=480, height=340)

# Prompt label and entry
prompt_label = tk.Label(root, text="Prompt")
prompt_label.place(x=500, y=10)
prompt_entry = tk.Text(root, height=3, width=40, font=("Helvetica", 10), wrap=tk.WORD)
prompt_entry.place(x=500, y=30)

# Submit button
submit_button = tk.Button(root, text="Submit", command=on_submit)
submit_button.place(x=721, y=75)

# Output label and text area
output_label = tk.Label(root, text="Response")
output_label.place(x=500, y=95)
output_text = tk.Text(root, height=10, width=40, font=("Helvetica", 10), wrap=tk.WORD)
output_text.place(x=500, y=115)
# Start capturing and updating the image
update_image()

# Start the GUI
root.mainloop()

would be great to have a quantized version!!

How do we get this running?

I got a

ImportError: cannot import name 'VisionEncoder' from 'moondream' error.

sujitvasanth commented 8 months ago

@oliverbob I think the reason is the hugging face model and GitHub repos have been updated since I posted the original code.. for the better I may add as the model can be run now using Trust local code=true so that the GitHub repo (i.e. reference to Moondream repo in the code I posted) is no longer needed to run inference. only the hugging face repo.. suggest to use the version on my github which is more up to date https://github.com/sujitvasanth/TinyLlava-Tk/blob/main/moondreamTk.py there's a video demo of it working here https://github.com/sujitvasanth/TinyLlava-Tk/tree/main

you can see vics GitHub repo history if you want to see vik's code from my time of writing. I have frozen copies of the older versions of moodream model on my hugging face repo which I save as I anticipated this would happen https://huggingface.co/sujitvasanth

regards, Sujit PS @vikhyat please keep going on your great project and model!

oliverbob commented 8 months ago

Have someone been able to play around converting the mentioned moodel to GGUF? Tinyllava or Moondream?

oliverbob commented 8 months ago

Do you have an example of using it with a web ui like gradio?

sujitvasanth commented 8 months ago

There is a gradio example in the original repo.. https://github.com/vikhyat/moondream/blob/main/gradio_demo.py might be out of date now but can be amended

KPCOFGS commented 6 months ago

Just to let you guys know, somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main