Enable PPOTrainer and DPOTrainer to work with audio-language models like Qwen2Audio. Architecture for this model is identical to vision-language models like LlaVa, consisting of embeddings taken from the audio encoder, projected by a simple linear layer into the language model embedding space.
The audio tower is usually frozen during training, so that just leaves the language model which is already very well supported and one linear layer to be trained. On paper, this seems simple to me but I'm unfamiliar with TRL's API so not sure how much effort this would be to implement it.
I realise this is probably not a very wanted feature and see on the LLava issue that there are no plans to integrate PPO with it. Hence, I can probably take a look at this at some point, I'll see if I can get it working first by extending any necessary classes.
Feature request
Enable PPOTrainer and DPOTrainer to work with audio-language models like Qwen2Audio. Architecture for this model is identical to vision-language models like LlaVa, consisting of embeddings taken from the audio encoder, projected by a simple linear layer into the language model embedding space.
The audio tower is usually frozen during training, so that just leaves the language model which is already very well supported and one linear layer to be trained. On paper, this seems simple to me but I'm unfamiliar with TRL's API so not sure how much effort this would be to implement it.
https://github.com/huggingface/trl/issues/1784
Motivation
I want to experiment with PPO on Qwen2Audio
Your contribution
I realise this is probably not a very wanted feature and see on the LLava issue that there are no plans to integrate PPO with it. Hence, I can probably take a look at this at some point, I'll see if I can get it working first by extending any necessary classes.