haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.23k stars 2.24k forks source link

[Question] Support for Multi-Image Input in One-Shot/Few-Shot Learning Scenarios #1662

Open vedernikovphoto opened 3 months ago

vedernikovphoto commented 3 months ago

Question

I'm working on a task that requires inputting multiple images sequentially during a conversation with LLaVA, aiming to perform one-shot or few-shot learning. The idea is to start by showing a few example images with corresponding descriptions, which the model uses as context. Afterward, I want to input a new image and have the model classify it based on the previously shown examples.

Could you provide guidance or examples on how to technically implement this within the existing framework? Specifically, I'm looking for the best way to maintain and utilize the context of multiple images throughout the interaction.

Thanks for your help!

ys-zong commented 3 months ago

Hi, we built a few-shot in-context learning repo that may be helpful for you. It has inference code of Llava and many other models. https://github.com/ys-zong/VL-ICL