How train mixtral MoE ?

sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Apache License 2.0

158 stars 8 forks source link

How train mixtral MoE ? #18

Open tommarques56 opened 1 month ago

tommarques56 commented 1 month ago

Hi, I just want to know if somebody have successfully trained a Mixtral like the 7x8B ? Because when I try, the output is random ( unreadable ).

thank !

sshh12 commented 1 month ago

My guess would be that the architecture might be different enough that this code would not work https://github.com/sshh12/multi_token/blob/main/multi_token/language_models/mistral.py. Potentially can duplicate that file and add an instance for Mixtral based on the huggingface implementation.

tommarques56 commented 1 month ago

Do you think it’s the same / better way to fine tune directly Mixtral, or use mistral 7b fine tuned for vision, create a MoE with tools like mergoo ( please take a look at mergoo because SSHH12 + mergoo can be a life changer ) and fine tune the created MoE ?

sshh12 commented 1 month ago

Hm my guess would be that merging after training the modality projector wouldn't work (at least out of the box with this library just bc of all the custom torch modules that get strapped onto the model). However should definitely be do-able to take an existing merge and add the modality to it by adding that hf architecture as I mentioned.