jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.
348 stars 53 forks source link

预训练数据问题 #13

Open lumiere-ml opened 7 months ago

lumiere-ml commented 7 months ago

想问下,sky数据集很大,整体下载有500G左右,麻烦是否能介绍下,模型训练用了哪些数据,总共多少tokens?

jiahe7ay commented 7 months ago

下载了前20个

lumiere-ml commented 7 months ago

这样会不会导致数据有偏之类的,请问下选择前20个和随机20个 影响大不