Open ktsi opened 9 months ago
Would absolutely love a community contribution here, I’m currently focused on training the next version so I won’t be able to work on quantizing until after that.
If you can point at the general direction I could try to mess with it. Noob here, unfortunately.
I'm experimenting with quants today. Will keep you guys updated.
I've tried to integrate this model with transformers but couldn't manage to correctly implement the image embeddings. Text was generating successfully even quantized.
The model would actually be useful if we can use it quantized - otherwise it still uses way too much vram so a quantized bigger model is still better. I really hope there is soon some progress on quantization or llama.cpp integration
Hi I've got it working with transformers so you can show it items on webcam and it will automatically detect on scene change ... my prompt is "Ignore mouse, pad, mat, desk; describe central object in 15 words. - here my code
import tkinter as tk
from tkinter import messagebox
from PIL import Image, ImageTk
from moondream import VisionEncoder, TextModel
from huggingface_hub import snapshot_download
from threading import Thread
from transformers import TextIteratorStreamer
import cv2
import numpy as np
import re
import time
model_running = False
prev_frame = None
change_detected = False
stabilization_time = 1200 # Time to wait for stabilization in milliseconds
last_change_time = 0
change_threshold =50000
change_sensitivity = 10
# Download and load the model outside of the function
model_path = snapshot_download("vikhyatk/moondream1")
vision_encoder = VisionEncoder(model_path)
text_model = TextModel(model_path)
# Initialize webcam capture
cap = cv2.VideoCapture(0)
# Model inference function
def moondream(prompt, img=None, vision_encoder=vision_encoder, text_model=text_model):
global model_running
model_running=True
if img is None:
# Capture a frame from the webcam
ret, frame = cap.read()
if not ret:
return "Failed to capture image from the webcam."
# Convert the frame to a PIL image
img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
image_embeds = vision_encoder(img)
streamer = TextIteratorStreamer(text_model.tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
image_embeds=image_embeds, question=prompt, streamer=streamer
)
thread = Thread(target=text_model.answer_question, kwargs=generation_kwargs)
thread.start()
buffer = ""
for new_text in streamer:
if not new_text.endswith("<") and not new_text.endswith("END"):
buffer += new_text
update_output(buffer)
else:
new_text = re.sub("<$", "", re.sub("END$", "", new_text))
buffer += new_text
update_output(buffer)
break
model_running=False
# Update output in the GUI
def update_output(text):
output_text.delete('1.0', tk.END)
output_text.insert(tk.END, text)
# Function to handle button click
def on_submit():
prompt_text = prompt_entry.get("1.0", "end-1c")
# Run the moondream function in a separate thread
if not model_running:
Thread(target=moondream, args=(prompt_text,)).start()
# Function to update the image on the label
def update_image():
global prev_frame, change_detected, last_change_time
ret, frame = cap.read()
if ret:
frame = cv2.flip(frame, 1)
frame = cv2.resize(frame, (480, 340))
# Convert the image to RGB
cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
img = Image.fromarray(cv2image)
imgtk = ImageTk.PhotoImage(image=img)
webcam_label.imgtk = imgtk
webcam_label.config(image=imgtk)
current_time = int(round(time.time() * 1000))
# Check for scene change
if prev_frame is not None:
change = detect_change(prev_frame, frame)
#print(change)
if change > change_threshold:
change_detected = True
last_change_time = current_time
elif change_detected and (current_time - last_change_time > stabilization_time):
change_detected = False
on_submit()
prev_frame = frame
webcam_label.after(100, update_image)
def detect_change(frame1, frame2):
diff = cv2.absdiff(frame1, frame2)
non_zero_count = np.sum(diff > change_sensitivity)
return non_zero_count
root = tk.Tk() # Tkinter GUI setup
root.title("Moondream")
root.geometry("800x380") # You can adjust the size as needed
# Webcam image label
webcam_label = tk.Label(root)
webcam_label.place(x=10, y=10, width=480, height=340)
# Prompt label and entry
prompt_label = tk.Label(root, text="Prompt")
prompt_label.place(x=500, y=10)
prompt_entry = tk.Text(root, height=3, width=40, font=("Helvetica", 10), wrap=tk.WORD)
prompt_entry.place(x=500, y=30)
# Submit button
submit_button = tk.Button(root, text="Submit", command=on_submit)
submit_button.place(x=721, y=75)
# Output label and text area
output_label = tk.Label(root, text="Response")
output_label.place(x=500, y=95)
output_text = tk.Text(root, height=10, width=40, font=("Helvetica", 10), wrap=tk.WORD)
output_text.place(x=500, y=115)
# Start capturing and updating the image
update_image()
# Start the GUI
root.mainloop()
would be great to have a quantized version!!
While not a full quantized version, you can cut the memory usage almost in half by changing from float32 to float16 (with no noticeable loss in performance).
In text_model.py
add dtype as follows:
self.model = load_checkpoint_and_dispatch(
self.model,
f"{model_path}/text_model.pt",
device_map={"": self.device.type},
dtype=torch.float16
)
In vision_encoder.py
change all references from torch.float32
to torch.float16
@Yazorp presumably that gives a performance speed advantage? can you quantify it?
also I have posted a model request for TheBloke's discord server to quantize the model. Please upvote the request here https://discord.com/channels/1111983596572520458/1201207189507932260
In theory, if the GPU supports FP16 processing then it should also increase performance somewhat (It generally depends on the driver / CUDA implementation). Note that the change to use FP16 instead of FP32 will likely not work on CPU only devices so a branching check for CPU device could be a better approach. INT4 / INT8 quantization should work for both and potentially be quite a bit faster.
That's right - I had it default to fp32 because it doesn't work on CPU with the current PyTorch implementation. The model itself was trained in 16 bit performance so you won't see any quality loss running in fp16.
@vikhyat thanks for moving the custom model code to hugging face.. as you rightly say much better and lighter weight... I was able to almost switch in and out of TinyLlama models and your own - see here https://github.com/sujitvasanth/TinyLlava-Tk and https://github.com/sujitvasanth/TinyLlava-Tk/blob/main/moondreamTk.py (your model) Now it is safetensors/huggingface I tried 4 bit quantisation with bits and bytes... it doesn't come up with any errors.. but in quantized model the text results are confused. It is in normal English sentences but it is hallucinating the objects which are not related to the images now. How can I fix this so can finally try quatisation? TinyLlava unquantized takes 4.5Gb of GPU while your model takes 9Gb despite the tensor file being 3.5Gb but yours is a little faster..
Also currently hugging face inference seems to default to CPU so had to add .to(0) to the model creation to get on GPU.
Hi I've got it working with transformers so you can show it items on webcam and it will automatically detect on scene change ... my prompt is "Ignore mouse, pad, mat, desk; describe central object in 15 words. - here my code
import tkinter as tk from tkinter import messagebox from PIL import Image, ImageTk from moondream import VisionEncoder, TextModel from huggingface_hub import snapshot_download from threading import Thread from transformers import TextIteratorStreamer import cv2 import numpy as np import re import time model_running = False prev_frame = None change_detected = False stabilization_time = 1200 # Time to wait for stabilization in milliseconds last_change_time = 0 change_threshold =50000 change_sensitivity = 10 # Download and load the model outside of the function model_path = snapshot_download("vikhyatk/moondream1") vision_encoder = VisionEncoder(model_path) text_model = TextModel(model_path) # Initialize webcam capture cap = cv2.VideoCapture(0) # Model inference function def moondream(prompt, img=None, vision_encoder=vision_encoder, text_model=text_model): global model_running model_running=True if img is None: # Capture a frame from the webcam ret, frame = cap.read() if not ret: return "Failed to capture image from the webcam." # Convert the frame to a PIL image img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) image_embeds = vision_encoder(img) streamer = TextIteratorStreamer(text_model.tokenizer, skip_special_tokens=True) generation_kwargs = dict( image_embeds=image_embeds, question=prompt, streamer=streamer ) thread = Thread(target=text_model.answer_question, kwargs=generation_kwargs) thread.start() buffer = "" for new_text in streamer: if not new_text.endswith("<") and not new_text.endswith("END"): buffer += new_text update_output(buffer) else: new_text = re.sub("<$", "", re.sub("END$", "", new_text)) buffer += new_text update_output(buffer) break model_running=False # Update output in the GUI def update_output(text): output_text.delete('1.0', tk.END) output_text.insert(tk.END, text) # Function to handle button click def on_submit(): prompt_text = prompt_entry.get("1.0", "end-1c") # Run the moondream function in a separate thread if not model_running: Thread(target=moondream, args=(prompt_text,)).start() # Function to update the image on the label def update_image(): global prev_frame, change_detected, last_change_time ret, frame = cap.read() if ret: frame = cv2.flip(frame, 1) frame = cv2.resize(frame, (480, 340)) # Convert the image to RGB cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) img = Image.fromarray(cv2image) imgtk = ImageTk.PhotoImage(image=img) webcam_label.imgtk = imgtk webcam_label.config(image=imgtk) current_time = int(round(time.time() * 1000)) # Check for scene change if prev_frame is not None: change = detect_change(prev_frame, frame) #print(change) if change > change_threshold: change_detected = True last_change_time = current_time elif change_detected and (current_time - last_change_time > stabilization_time): change_detected = False on_submit() prev_frame = frame webcam_label.after(100, update_image) def detect_change(frame1, frame2): diff = cv2.absdiff(frame1, frame2) non_zero_count = np.sum(diff > change_sensitivity) return non_zero_count root = tk.Tk() # Tkinter GUI setup root.title("Moondream") root.geometry("800x380") # You can adjust the size as needed # Webcam image label webcam_label = tk.Label(root) webcam_label.place(x=10, y=10, width=480, height=340) # Prompt label and entry prompt_label = tk.Label(root, text="Prompt") prompt_label.place(x=500, y=10) prompt_entry = tk.Text(root, height=3, width=40, font=("Helvetica", 10), wrap=tk.WORD) prompt_entry.place(x=500, y=30) # Submit button submit_button = tk.Button(root, text="Submit", command=on_submit) submit_button.place(x=721, y=75) # Output label and text area output_label = tk.Label(root, text="Response") output_label.place(x=500, y=95) output_text = tk.Text(root, height=10, width=40, font=("Helvetica", 10), wrap=tk.WORD) output_text.place(x=500, y=115) # Start capturing and updating the image update_image() # Start the GUI root.mainloop()
would be great to have a quantized version!!
How do we get this running?
I got a
ImportError: cannot import name 'VisionEncoder' from 'moondream' error.
@oliverbob I think the reason is the hugging face model and GitHub repos have been updated since I posted the original code.. for the better I may add as the model can be run now using Trust local code=true so that the GitHub repo (i.e. reference to Moondream repo in the code I posted) is no longer needed to run inference. only the hugging face repo.. suggest to use the version on my github which is more up to date https://github.com/sujitvasanth/TinyLlava-Tk/blob/main/moondreamTk.py there's a video demo of it working here https://github.com/sujitvasanth/TinyLlava-Tk/tree/main
you can see vics GitHub repo history if you want to see vik's code from my time of writing. I have frozen copies of the older versions of moodream model on my hugging face repo which I save as I anticipated this would happen https://huggingface.co/sujitvasanth
regards, Sujit PS @vikhyat please keep going on your great project and model!
Have someone been able to play around converting the mentioned moodel to GGUF? Tinyllava or Moondream?
Do you have an example of using it with a web ui like gradio?
There is a gradio example in the original repo.. https://github.com/vikhyat/moondream/blob/main/gradio_demo.py might be out of date now but can be amended
Just to let you guys know, somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main
Hi. Thanks for this great small model! Is it in your plans to provide a ggml version? A coreML version?