jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.
327 stars 52 forks source link

预训wiki练数据处理问题 #22

Closed daneren closed 4 months ago

daneren commented 5 months ago

https://github.com/jiahe7ay/MINI_LLM/blob/6115a596b60bab2ed4e0c91e8f00dfe068ca7b29/dataset_utils/generate_data.py#L194C5-L194C55

在代码中对于wikipedia-cn-20230720-filtered.json进行chunk处理,请问为什么要进行这一步呀,会带来效果的提升,还是为了其它什么目的。

jiahe7ay commented 5 months ago

只是为了把训练效率尽可能地提高 不浪费一丝占满显存的机会