togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

will there be a trained model? #19

Closed rozek closed 1 year ago

rozek commented 1 year ago

First of all: thank you very much for your contribution!

That said, I still have a question: in order to really "democratise" AI, a trained model will be needed that may be used for (fine-tuning and) inference - not too many people have the resources to train a new model from scratch.

Will such a model be made available? And, if yes, do you have any idea when?

Thanks in advance for your effort!

Andreas Rozek

fnpanic commented 1 year ago

As you stated correctly it requires massive amount of compute ressources to train a model with this dataset from scratch. According to the blogpost from together i would guess it is already in training:

Having reproduced the pre-training data, the next step is to train a strong base model. As part of the INCITE program, with support from Oak Ridge Leadership Computing Facility (OLCF), we are training a full suite of models, with the first becoming available in the coming weeks.

source: https://www.together.xyz/blog/redpajama

rozek commented 1 year ago

Great! That sounds promising! Thank you very much for this hint!