Closed gesen2egee closed 3 months ago
I follow you completely, that's why I posted the same idea in #263 :)
I've implemented JoyCaption in my own fork if you want to give it a go (will need to manually install TagGUI from my fork, though), but there is one pretty big problem: JoyCaption requires Transformers 4.43+ (based on the official script) to work, but CogVLM breaks with Transformers 4.42+. You can fix CogVLM by editing one of its Python scripts (there's a diff in the CogVLM2 LLaMA3 Chat repo on HuggingFace), but, y'know, you need to edit another model to get JoyCaption to work. Probably why jhc hasn't implemented it yet, tbh.
Some models have already required me to edit their source code at runtime, so applying that fix to CogVLM could be doable.
The main reasons I haven't added suport for Joy Caption yet are:
Closing this issue as it's a duplicate of #263.
thread a script example demo
Perhaps this is currently the best caption model for NSFW, based on the Llama 3.1 8B Lora adapter and siglip. It is currently in pre-alpha status, but both the detail description and accuracy are excellent. It can use with the 4-bit BNB Llama 3.1