I'm working on a task that requires inputting multiple images sequentially during a conversation with LLaVA, aiming to perform one-shot or few-shot learning. The idea is to start by showing a few example images with corresponding descriptions, which the model uses as context. Afterward, I want to input a new image and have the model classify it based on the previously shown examples.
Could you provide guidance or examples on how to technically implement this within the existing framework? Specifically, I'm looking for the best way to maintain and utilize the context of multiple images throughout the interaction.
Hi, we built a few-shot in-context learning repo that may be helpful for you. It has inference code of Llava and many other models. https://github.com/ys-zong/VL-ICL
Question
I'm working on a task that requires inputting multiple images sequentially during a conversation with LLaVA, aiming to perform one-shot or few-shot learning. The idea is to start by showing a few example images with corresponding descriptions, which the model uses as context. Afterward, I want to input a new image and have the model classify it based on the previously shown examples.
Could you provide guidance or examples on how to technically implement this within the existing framework? Specifically, I'm looking for the best way to maintain and utilize the context of multiple images throughout the interaction.
Thanks for your help!