Closed calmitchell617 closed 17 hours ago
Hi @calmitchell617, welcome to the repo and thanks for opening this! Sample packing and unstructured datasets for CPT have been on top of my to-do list for sometime. Do you mind sharing some examples of text corpuses you might want to train with?
@RdoubleA, thanks for the fast response.
One good example might be The Stack V1, as it is a helpful starting point when training coding assistants. I know a few people personally who would appreciate seeing an example using that dataset in particular.
I would be happy to attempt contributing a PR. I have already looked at the code quite a lot today, and will continue to work on the issue on my own, anyways.
Yeah that looks like a great example, and would be good to have that in the repo. Since it's a massive dataset, you would need to add streaming support via load_dataset(stream=True)
. If you're interested in opening up a PR, I'm more than happy to take a look at it and work with you on it.
Great, I will make a PR in the next few days.
Looks like it was solved with the PR. Please feel free to reopen this issue. Thanks!
Hello, thanks very much for the excellent work on this repo.
There are several examples showing how to create a question-response style dataset, but I can't immediately tell how to continue pretraining with, for example, a corpus of unstructured text.
Are there any examples showing how to pack text examples and continue pretraining?
Thank you