[Question] Support for Multi-Image Input in One-Shot/Few-Shot Learning Scenarios

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Apache License 2.0

20.23k stars 2.24k forks source link

Question

I'm working on a task that requires inputting multiple images sequentially during a conversation with LLaVA, aiming to perform one-shot or few-shot learning. The idea is to start by showing a few example images with corresponding descriptions, which the model uses as context. Afterward, I want to input a new image and have the model classify it based on the previously shown examples.

Could you provide guidance or examples on how to technically implement this within the existing framework? Specifically, I'm looking for the best way to maintain and utilize the context of multiple images throughout the interaction.

Thanks for your help!

haotian-liu / LLaVA

[Question] Support for Multi-Image Input in One-Shot/Few-Shot Learning Scenarios #1662

Question