unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.83k stars 1.07k forks source link

Add support for Qwen2Audio #1018

Open jonflynng opened 1 week ago

jonflynng commented 1 week ago

Qwen2Audio huggingface docs

I see there's been a couple requests for vision-language model support like LLaVa:

https://github.com/unslothai/unsloth/issues/491 https://github.com/unslothai/unsloth/issues/158

For Qwen2Audio, the methodology is the same as LLaVa but works on audio rather than images. It uses a large pre-trained audio model (Whisper) to generate and project embeddings into the language model space. The embeddings are then injected into the input sequence of the language model.

I think we can still get some performance benefits by using Unsloth just for the language model for now (a lot of these models have the vision/audio towers frozen anyway) unless you see some conflicts or areas it could break? https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_audio/modeling_qwen2_audio.py#L858-L870

jonflynng commented 1 week ago

Looking into it, seems like the language model differs a bit in the architecture from the normal qwen2 models looking at the config.json so maybe this will need work from the Unsloth side.

danielhanchen commented 1 week ago

Will look into this!