Open w4nderlust opened 1 year ago
Something like the Dataset concept in torch_geometric would be great and could accommodate for things like this. https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.Dataset.html#torch_geometric.data.Dataset That would conceptually move the transformation section to be a property of the dataset. I understand however that the yaml-based workflow better fits the proposed additions.
I am here to learn more about Ludwig, I can start working on it, I am sure with your guidance, I can successfully merge a PR. Do I need to get assigned first ?
@JaynouOliver sorry for the slow answer. @arnavgarg1 is starting to work on pretraining, you should sync and figure out a self contained PR for it :) thank yo uso much for proposing to help, you message slipped through the cracks for me
Thanks for the reply. @arnavgarg1 can you help me to proceed with this?
Hi @JaynouOliver - lets grab some time to sync on this. Have you joined Ludwig slack?
Yes already joined Slack and taking part in the contest too! I will ping you
Self-supervised pre-training is the mechanism through which LLMs are pre-trained and also fine-tuned in absence of instruction datasets or human preferences, but when text documents are available.
The mechanism is actually simple: treat each document as a long sequence of tokens and create windows of tokens as inputs and use the next token as the output to predict.
In Ludwig right now we have the capability to train models in such a way, but we dontì't have the windowing data transformation that makes it easy.
So the proposed feature is to make it seamless for users to pre-train and fine-tune on a set of documents.
Ideally the feature could be broken down in: