[Feature Request] Support multimodal LLM, e.g., llava

pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD 3-Clause "New" or "Revised" License

3.54k stars 290 forks source link

[Feature Request] Support multimodal LLM, e.g., llava #811

Open StarCycle opened 2 months ago

StarCycle commented 2 months ago

Hello,

Would you like to support mllm like llava?

### Tasks

RdoubleA commented 2 months ago

Hi @StarCycle, thanks for the feature request. Multimodal support is something we are still exploring. Would love to learn more about what you would like to use it for. And of course we welcome any initial prototype, if you're interested in contributing this :)

StarCycle commented 2 months ago

Hi @RdoubleA,

Currently I am training llava with Xtuner, which is similar to torchtune. They support finetuning, evaluation and deployment of llava models (we can easily add custom modification to the models). Integration of LLaVA 1.6 and video input is on the way. You can take their implementation as a reference :)

But they rely on HuggingFace transformers...I guess torchtune has less independency, which will be quite good!