jabberjabberjabber / LLavaImageTagger

Creates an index of images, queries a local LLM and adds tags to the image metadata
Other
54 stars 5 forks source link

qwen2_vl_7b does not yet have a gguf model. Are we blocked till it is supported by kobold ? #3

Closed saket424 closed 1 month ago

saket424 commented 1 month ago

https://www.reddit.com/r/LocalLLaMA/comments/1f4q0ag/qwen2_vl_7b_far_more_impressive_than_i_thought/?chainedPosts=t3_1f7cdhj https://github.com/IuvenisSapiens/ComfyUI_Qwen2-VL-Instruct

jabberjabberjabber commented 1 month ago

We are limited to what KoboldCpp supports since we directly use the KoboldCpp API. This script is entirely based around that concept, since it removes the need to install machine learning python libraries and compile binaries. Just one file to download and run and we have removed 90% of the complexity of dealing with the LLM on our end.

We are sacrificing the bleeding edge for stability and convenience though. If you want to run the latest models you are going to be dealing with a whole lot of moving targets and the benefits are, in my opinion, not enough to make that worth it. The good stuff trickles down eventually anyway.

I think MiniCPM V 2.6 is pretty much all we need for this task. It follows instructions well enough (not nearly as well as Phi for outputting JSON though) and the image part is pretty great -- the OCR is top notch, and it is legitimately fast (~5seconds per image on an RTX3080) so I'm not sure what the benefits would be to using a different one, except for perhaps better direction following, which the Qwen2-VL github page claims it isn't great for anyway.

Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.

Finally, I started this project with the goal of being able to index large folder trees of digital camera dumps with generic named files, with many shots being slightly different angles of the same thing. I have accumulated an unmanageable number of these over more than a decade of documenting hardware repairs, project logs, and other various things. But once the keywords are in metadata the images can be easily sorted through a program like Diffractor dynamically and without each image being tied to a spot in a database. Any of the multimodal models supported by KoboldCpp can do that adequately. But of course everyone has a different problem to solve, but I am not good enough at coding or creative enough at imagining to make anything but what I need for myself.

saket424 commented 1 month ago

@jabberjabberjabber I tend to agree with you that the best models will eventually make their way and be supported . My use case is to tag a snapshot from a live camera and detect abnormal behavior such as a fight or someone in distress etc even as the scene is unfolding and the minicpmv2.6 model is plenty good as it is with your keyword captioner prompt frame_0002

Output: Here's a JSON object with the generated keywords:

{
  "Keywords": [
    "school", "locker room", "physical altercation", "students", "friction", "conflict", "uniforms", "physical education", "indoor setting", "gray", "blue", "yellow", "action", "schoolyard"
  ]
}

These keywords capture the main elements and context of the image, such as the school setting, the physical altercation between students, their uniforms, and the colors of the surroundings.