Self-supervised pre-training of LLMs

w4nderlust commented 1 year ago

Self-supervised pre-training is the mechanism through which LLMs are pre-trained and also fine-tuned in absence of instruction datasets or human preferences, but when text documents are available.

The mechanism is actually simple: treat each document as a long sequence of tokens and create windows of tokens as inputs and use the next token as the output to predict.

In Ludwig right now we have the capability to train models in such a way, but we dontì't have the windowing data transformation that makes it easy.

So the proposed feature is to make it seamless for users to pre-train and fine-tune on a set of documents.

Ideally the feature could be broken down in:

write a preprocessing function that given a document creates inputs and outputs for self-supervised training
extend the ludwig config system to include a transformation section that could include the above function
optimize the above function to work in a pipelined/batch/streaming/buffered fashion to avoid materializing huge amounts of repetitive data

RaulPPelaez commented 1 year ago

Something like the Dataset concept in torch_geometric would be great and could accommodate for things like this. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Dataset.html#torch_geometric.data.Dataset That would conceptually move the transformation section to be a property of the dataset. I understand however that the yaml-based workflow better fits the proposed additions.

JaynouOliver commented 11 months ago

I am here to learn more about Ludwig, I can start working on it, I am sure with your guidance, I can successfully merge a PR. Do I need to get assigned first ?

w4nderlust commented 10 months ago

@JaynouOliver sorry for the slow answer. @arnavgarg1 is starting to work on pretraining, you should sync and figure out a self contained PR for it :) thank yo uso much for proposing to help, you message slipped through the cracks for me

JaynouOliver commented 10 months ago

Thanks for the reply. @arnavgarg1 can you help me to proceed with this?

arnavgarg1 commented 10 months ago

Hi @JaynouOliver - lets grab some time to sync on this. Have you joined Ludwig slack?

JaynouOliver commented 10 months ago

Yes already joined Slack and taking part in the contest too! I will ping you

ludwig-ai / ludwig

Self-supervised pre-training of LLMs #3665