Open matbee-eth opened 2 months ago
I haven't had the time to upgrade this but happy to advise anyone who wants to try it.
In theory, you'd just need to add qwen2 as a model like https://github.com/sshh12/multi_token/blob/main/multi_token/language_models/mistral.py and then train with a dataset that includes audio and vision (both together supported).
Would be great if you worked out a system to allow us to fine-tune QWen2-VL (rather than LLaVa) from your custom projector setup. They have Qwen2-Audio and Qwen2-VL, but no A+VL.