ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.12k stars 1.19k forks source link

Self-supervised pre-training of LLMs #3665

Open w4nderlust opened 1 year ago

w4nderlust commented 1 year ago

Self-supervised pre-training is the mechanism through which LLMs are pre-trained and also fine-tuned in absence of instruction datasets or human preferences, but when text documents are available.

The mechanism is actually simple: treat each document as a long sequence of tokens and create windows of tokens as inputs and use the next token as the output to predict.

In Ludwig right now we have the capability to train models in such a way, but we dontì't have the windowing data transformation that makes it easy.

So the proposed feature is to make it seamless for users to pre-train and fine-tune on a set of documents.

Ideally the feature could be broken down in:

RaulPPelaez commented 1 year ago

Something like the Dataset concept in torch_geometric would be great and could accommodate for things like this. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Dataset.html#torch_geometric.data.Dataset That would conceptually move the transformation section to be a property of the dataset. I understand however that the yaml-based workflow better fits the proposed additions.

JaynouOliver commented 11 months ago

I am here to learn more about Ludwig, I can start working on it, I am sure with your guidance, I can successfully merge a PR. Do I need to get assigned first ?

w4nderlust commented 10 months ago

@JaynouOliver sorry for the slow answer. @arnavgarg1 is starting to work on pretraining, you should sync and figure out a self contained PR for it :) thank yo uso much for proposing to help, you message slipped through the cracks for me

JaynouOliver commented 10 months ago

Thanks for the reply. @arnavgarg1 can you help me to proceed with this?

arnavgarg1 commented 10 months ago

Hi @JaynouOliver - lets grab some time to sync on this. Have you joined Ludwig slack?

JaynouOliver commented 10 months ago

Yes already joined Slack and taking part in the contest too! I will ping you