Seeking guidance on continuing pretraining

pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning

BSD 3-Clause "New" or "Revised" License

3.52k stars 282 forks source link

Seeking guidance on continuing pretraining #809

Closed calmitchell617 closed 17 hours ago

calmitchell617 commented 2 months ago

Hello, thanks very much for the excellent work on this repo.

There are several examples showing how to create a question-response style dataset, but I can't immediately tell how to continue pretraining with, for example, a corpus of unstructured text.

Are there any examples showing how to pack text examples and continue pretraining?

Thank you

RdoubleA commented 2 months ago

Hi @calmitchell617, welcome to the repo and thanks for opening this! Sample packing and unstructured datasets for CPT have been on top of my to-do list for sometime. Do you mind sharing some examples of text corpuses you might want to train with?

calmitchell617 commented 2 months ago

@RdoubleA, thanks for the fast response.

One good example might be The Stack V1, as it is a helpful starting point when training coding assistants. I know a few people personally who would appreciate seeing an example using that dataset in particular.

I would be happy to attempt contributing a PR. I have already looked at the code quite a lot today, and will continue to work on the issue on my own, anyways.

RdoubleA commented 2 months ago

Yeah that looks like a great example, and would be good to have that in the repo. Since it's a massive dataset, you would need to add streaming support via load_dataset(stream=True). If you're interested in opening up a PR, I'm more than happy to take a look at it and work with you on it.

calmitchell617 commented 2 months ago

Great, I will make a PR in the next few days.

felipemello1 commented 17 hours ago

Looks like it was solved with the PR. Please feel free to reopen this issue. Thanks!