sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
158 stars 8 forks source link

Summarize video #15

Closed linchen111 closed 2 months ago

linchen111 commented 3 months ago

If my multi is continuous images from like screenshots, what should be my prompt when I use Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA

sshh12 commented 3 months ago

This is the format I used: https://github.com/sshh12/multi_token/blob/6eb9813edf2e8ddbff951bca4b2f3d65b6b1206e/scripts/llava_gpt_build_multi_image_finetune_dataset.py#L88

<image><image><image> What is happening in these frames?

Although not sure how well it work given my training data was mainly compare/contrast rather than video understanding.

It's only trained for up to 6 images, may work for more though.