For Qwen2Audio, the methodology is the same as LLaVa but works on audio rather than images. It uses a large pre-trained audio model (Whisper) to generate and project embeddings into the language model space. The embeddings are then injected into the input sequence of the language model.
Looking into it, seems like the language model differs a bit in the architecture from the normal qwen2 models looking at the config.json so maybe this will need work from the Unsloth side.
Qwen2Audio huggingface docs
I see there's been a couple requests for vision-language model support like LLaVa:
https://github.com/unslothai/unsloth/issues/491 https://github.com/unslothai/unsloth/issues/158
For Qwen2Audio, the methodology is the same as LLaVa but works on audio rather than images. It uses a large pre-trained audio model (Whisper) to generate and project embeddings into the language model space. The embeddings are then injected into the input sequence of the language model.
I think we can still get some performance benefits by using Unsloth just for the language model for now (a lot of these models have the vision/audio towers frozen anyway) unless you see some conflicts or areas it could break? https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_audio/modeling_qwen2_audio.py#L858-L870