sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
176 stars 12 forks source link

QWen2 Audio + Visual #28

Open matbee-eth opened 2 months ago

matbee-eth commented 2 months ago

Would be great if you worked out a system to allow us to fine-tune QWen2-VL (rather than LLaVa) from your custom projector setup. They have Qwen2-Audio and Qwen2-VL, but no A+VL.

sshh12 commented 1 month ago

I haven't had the time to upgrade this but happy to advise anyone who wants to try it.

In theory, you'd just need to add qwen2 as a model like https://github.com/sshh12/multi_token/blob/main/multi_token/language_models/mistral.py and then train with a dataset that includes audio and vision (both together supported).