sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
158 stars 8 forks source link

Training with no pretrained encoder - just projection from ready embeddings #20

Open tehila17-meet opened 1 month ago

tehila17-meet commented 1 month ago

Do you have an example of training a modality that has no pretrained encoder? I want to only train the projector on ready embeddings.

My use case is a dataset of an array of numbers (each number indicating a voxel (from fmri data) intensity) and their corresponding english sentence. I want to treat the voxel array as an embedding vector that needs to be projected into a higher dimension according to the textual embeddings of each array and its corresponding sentence.

Any help would be appreciated.

sshh12 commented 1 month ago

Hey that sounds super cool!

I don't have an example off hand but this is very do-able. Youd essentially just have "preprocess rows" return the raw voxel data, then "forward" just do nothing and return the voxel data, and then have the build projector function create a custom torch module that converts your voxel data into the same shape at the tokens (your custom embedding + dense layer to get it to the right token shape).

tehila17-meet commented 3 weeks ago

Hey, so it works but with a relatively high loss and im thinking bc the input dimension is an embedding of size 249 and its trying to be projected into a a dimension of [8, 4096] (8 tokens). Do you have any ideas how i can optimize this projector?

sshh12 commented 3 weeks ago

More data? In theory 249 to 8 tokens will actually overfit easily (so low training loss but high test).

You can also try pre-training the projector on some proxy task (e.g. train 249 - part of projector -> classifier and then chop the classifier off). This could help debug the embeding quality as well.

sshh12 commented 3 weeks ago

Will also note that loss especially in the context of lora fine-tuning like this can be misleading / not an accurate representation of efficiency. It's worth just sampling/testing your weights and seeing what's getting spit out and if it's anyway coherent.

tehila17-meet commented 2 weeks ago

thanks for replying :)

I have another question regarding the generate parameters - is there a reason you didnt configure top_p, top_k and a specific temperature? and if so why?

sshh12 commented 2 weeks ago

This library was mainly to proof of concept these different modalities so didn't mess with decoding params too much. Not reason it's not included (they'd work the same as any other huggingface model).