Implementation of InstructBLIP for quantized models + user interface

Hey LAVIS team, thanks for all your work on the BLIP series and all your open source code. 🙌

I just wanted to share that I've created a small project to allow multimodal inference of InstructBLIP on quantized Vicuna models running on the text-generation-webui with an AutoGPTQ backend. This is a popular user-level application that makes it easier to run language models, maintain a context, etc.

Repo: https://github.com/kjerk/instructblip-pipeline

As someone who wanted to use InstructBLIP and experiment with instruction tuning because of the high quality output, I was running into VRAM constraints and some usability woes on the vanilla models running directly on the transformers framework. Okay for devs, but rough for users. So hopefully this helps a few more people to be able to use InstructBLIP with such large models on modest hardware (~20GB down to ~6GB).

A cool bonus is, even though InstructBLIP was fine tuned on Vicuna (and T5), that other related LLMs (detailed in the repo's readme) can actually consume the same BLIP embeddings without losing coherence. Not just locked to Vicuna. Super interesting!

Thanks again!

PS. Your lavis@salesforce.com email seems dead, got a bounce from google. I wasn't sure where else to put this. 😄

salesforce / LAVIS

Implementation of InstructBLIP for quantized models + user interface #431