sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
158 stars 8 forks source link

Thanks for the great work #6

Closed codybum closed 4 months ago

codybum commented 6 months ago

I am very excited about your work and its potential applications. I have been following the multi-modal LLM space for a while and am glad to see things moving beyond images and text. I was really impressed by LLaVA and the recent Med-LLaVA work, but without modification they are/we limited to images and single images at that. There have been a few publications related to generalized inputs and LLMs, but I have not seen much actionable code or models, so I was very happy to see your work.

Our primary interest is in the application of multi-modal models to medicine for research and academic purposes. The inputs might be signal, tabular, image, multi-image, text, or other data. The majority of our applications require several independent inputs, like a set of images, and potentially in-context learning where the output of the first request programmatically impacts the second and so on. The Otter project (https://github.com/Luodian/Otter), which I am not associated with, has some interesting in-context features:. A related project OtterHD (https://github.com/Luodian/Otter/blob/main/docs/OtterHD.md) allow for much larger image sizes than you would typically get with extracted image features. The downside of these projects are that they seem to rely on specific models for both embeddings and the underlying LLM.

Being in an academic setting we have access to computational resources for training and testing. I would be happy to contribute to your effort through model training and potentially code contributions. I would especially be interested in training something for the new Mixtral dataset, which in our limited experience seem to work very well, with less computational requirements that a 70B model.

On a related note, the use of Lora adapters I think will be extremely important as modular inference engines are developed. There are efforts underway on several projects to allow the run-time loading of adapters using the same based model, allowing potentially hundreds or thousands of models to be run from a single server.

I would be interested to hear your thoughts.

sshh12 commented 6 months ago

Hey, that sounds great and there's a ton of overlap here.

Assuming the independent inputs have some encoder (image = CLIP, text = some document embedding model or just dump in context, tabular (?), signal (?)) it should be very do-able to train a multi-domain, multi-input Mixtral.

The main limitations I mentioned in the post are datasets and compute. Do you have a specific dataset + set of modalities + model you are interested in?

On a related note, the use of Lora adapters I think will be extremely important as modular inference engines are developed.

Currently only set up to do LoRA-based LLM training so that works! Will also note that the library doesn't support multi-GPU training but this is potentially not a bottleneck unless the datasets are much larger (1M+) since LoRA and LMM projectors train fairly fast.

codybum commented 6 months ago

Longer-term it would be interesting to train a MOE where various experts were trained on different things. While details are limited, this is what they claim to be doing here: https://huggingface.co/CausalLM/8x7B-MoE-test-NOT-MIXTRAL

Retraining and/or merging Med-Lava (https://github.com/microsoft/LLaVA-Med) would be a good first start. We also have other datasets like EEG, which could be very interesting.

Multi-GPU and multi-node training will be important, these datasets can be huge. Accelerate claims to make multi-gpu easier, but you might have other constrains.

sshh12 commented 6 months ago

Sounds good. After the holidays I could create a quick script to format LLAVA-Med into the right format and send over the suggested training cli commands for you to try when you get a chance.

High level implementation is very similar to LLAVA which does support multi-gpu so definitely think it's do-able but would require some trial and error in multi-gpu/node environment.

sshh12 commented 5 months ago

It turns out that post-holiday commitments are more than expected. I won't be able to provide the preprocessing script but I am happy to provide any guidance or library debugging if you are still interested in using this for your work!