Closed linchen111 closed 2 months ago
This is the format I used: https://github.com/sshh12/multi_token/blob/6eb9813edf2e8ddbff951bca4b2f3d65b6b1206e/scripts/llava_gpt_build_multi_image_finetune_dataset.py#L88
<image><image><image> What is happening in these frames?
Although not sure how well it work given my training data was mainly compare/contrast rather than video understanding.
It's only trained for up to 6 images, may work for more though.
If my multi is continuous images from like screenshots, what should be my prompt when I use Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA