Question about data preprocess

hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs

Apache License 2.0

25.26k stars 3.13k forks source link

Question about data preprocess #4570

Closed HackGiter closed 2 days ago

HackGiter commented 2 days ago

I just notice that preprocess of dataset use preprocess_pretrain_dataset function to concatenate different examples and chunk them into cutoff_len. which means that preprocess of dataset will throw away part of sequence whose length shorter than cutoff_len? It is kind of waste if you throw away tokens of cutoff_len every 1000 examples. I'm not sure, is it a common way to handle sequences of pretrain?

hiyouga commented 2 days ago

https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt#preparing-the-dataset