Add support for Qwen2Audio

jonflynng commented 1 week ago

I see there's been a couple requests for vision-language model support like LLaVa:

https://github.com/unslothai/unsloth/issues/491 https://github.com/unslothai/unsloth/issues/158

For Qwen2Audio, the methodology is the same as LLaVa but works on audio rather than images. It uses a large pre-trained audio model (Whisper) to generate and project embeddings into the language model space. The embeddings are then injected into the input sequence of the language model.

I think we can still get some performance benefits by using Unsloth just for the language model for now (a lot of these models have the vision/audio towers frozen anyway) unless you see some conflicts or areas it could break? https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_audio/modeling_qwen2_audio.py#L858-L870

jonflynng commented 1 week ago

Looking into it, seems like the language model differs a bit in the architecture from the normal qwen2 models looking at the config.json so maybe this will need work from the Unsloth side.

danielhanchen commented 1 week ago

Will look into this!

unslothai / unsloth

Add support for Qwen2Audio #1018