Hello,
I am a bit confused by the pipeline.
When I look at the "Enrico" data mixtures.
Is it the final format that is used to train the model ? (beside tokenization of course) or are there other steps (as I see "Patterns" in templates.py) that need to be run to uniformize the various data sources.
If so what script needs to be run to process those data and spit out ready to train data sets.
Thanks and sorry if this sound stupid.
Hello, I am a bit confused by the pipeline. When I look at the "Enrico" data mixtures. Is it the final format that is used to train the model ? (beside tokenization of course) or are there other steps (as I see "Patterns" in templates.py) that need to be run to uniformize the various data sources. If so what script needs to be run to process those data and spit out ready to train data sets. Thanks and sorry if this sound stupid.