数据预处理时的问题

jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.

348 stars 53 forks source link

Open zhangrui17 opened 5 months ago

zhangrui17 commented 5 months ago

lines.append(text + "<|im_end|>") chunk_data = split_txt_corpus_to_chunk_en(lines)

这样如果前一个样本长度刚好在2048附近，会出现‘’<|im_end|>‘’的各个字符被截断分开到两个不同的样本中吗？

zhangrui17 commented 5 months ago

还有如果是处理英文数据，这个函数也不适用，因为会把英文中的字母截断

A1yez1 commented 4 months ago

会存在这个问题，你可以重新写一个分割代码