google-research / FLAN

Apache License 2.0
1.48k stars 156 forks source link

[Question] Training data formatting for other custom models #84

Open vince62s opened 1 year ago

vince62s commented 1 year ago

Hello, I am a bit confused by the pipeline. When I look at the "Enrico" data mixtures. Is it the final format that is used to train the model ? (beside tokenization of course) or are there other steps (as I see "Patterns" in templates.py) that need to be run to uniformize the various data sources. If so what script needs to be run to process those data and spit out ready to train data sets. Thanks and sorry if this sound stupid.