Integration of CogVLM2 Model Support Request

dicksensei69 commented 6 months ago

The recently launched CogVLM2 model series from THUDM offers significant improvements in image understanding and captioning capabilities. With its support for longer text inputs (up to 8K tokens) and higher image resolutions (up to 1344x1344 pixels), CogVLM2 could greatly enhance Taggui's automatic caption and tag generation feature.

Benefits:

Improved accuracy and quality of automatically generated captions and tags, leveraging CogVLM2's advanced image understanding capabilities. Support for higher image resolutions, accommodating a wider range of use cases. Enhanced user experience by providing access to the latest advancements in vision-language models within Taggui's familiar interface.

eraser851 commented 6 months ago

int4 version just dropped: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B-int4/tee/main

jhc13 commented 6 months ago

I plan to add this model, but unfortunately, I will be too busy for the next week or so.

jhc13 commented 6 months ago

It seems to be only available for Linux.

dicksensei69 commented 6 months ago

How strange, thought they all ran on transformers. I personally have a Linux system so it won't be a problem for me but I understand if you don't want to implement anymore. It still looks like a powerful model, vision size and llama3. Thanks for your efforts, I really like taggui :)

jhc13 commented 6 months ago

It does use Transformers, but it's not officially integrated into the library, so it contains custom code and dependencies.

I am still working on adding it for Linux users.

jhc13 commented 6 months ago

The model has been added in v1.26.0 (Linux only).

jhc13 / taggui

Integration of CogVLM2 Model Support Request #149