salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.29k stars 918 forks source link

Implementation of InstructBLIP for quantized models + user interface #431

Open kjerk opened 1 year ago

kjerk commented 1 year ago

Hey LAVIS team, thanks for all your work on the BLIP series and all your open source code. πŸ™Œ

I just wanted to share that I've created a small project to allow multimodal inference of InstructBLIP on quantized Vicuna models running on the text-generation-webui with an AutoGPTQ backend. This is a popular user-level application that makes it easier to run language models, maintain a context, etc.

Repo: https://github.com/kjerk/instructblip-pipeline

As someone who wanted to use InstructBLIP and experiment with instruction tuning because of the high quality output, I was running into VRAM constraints and some usability woes on the vanilla models running directly on the transformers framework. Okay for devs, but rough for users. So hopefully this helps a few more people to be able to use InstructBLIP with such large models on modest hardware (~20GB down to ~6GB).

A cool bonus is, even though InstructBLIP was fine tuned on Vicuna (and T5), that other related LLMs (detailed in the repo's readme) can actually consume the same BLIP embeddings without losing coherence. Not just locked to Vicuna. Super interesting!

Thanks again!

PS. Your lavis@salesforce.com email seems dead, got a bounce from google. I wasn't sure where else to put this. πŸ˜„

unoriginalscreenname commented 1 year ago

@kjerk nailed it here! Thank you. He's right that the instructions you have in this repo to setup and run InstructBLIP don't actually work out of the box and the non quantized model is just too big for even a 24bg card. It might be nice if you all helped make this kind of thing a little more accessible. I have a project where I need to create some specialized captions for a lot of archival images and really wanted to work with instructBLIP. I'm hopeful I can figure out kjerk's implementation.