Summarize video - Githubissues

sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Apache License 2.0

158 stars 8 forks source link

Closed linchen111 closed 2 months ago

linchen111 commented 3 months ago

If my multi is continuous images from like screenshots, what should be my prompt when I use Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA

sshh12 commented 3 months ago

<image><image><image> What is happening in these frames?

Although not sure how well it work given my training data was mainly compare/contrast rather than video understanding.

It's only trained for up to 6 images, may work for more though.