jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.
348 stars 53 forks source link

大哥 麻烦给个数据样本参考一下 #5

Open kingpingyue opened 7 months ago

kingpingyue commented 7 months ago

大哥 麻烦给个数据样本参考一下,我想了解一下 数据处理部分

jiahe7ay commented 7 months ago

"text":xxxxxxxxx (最长为512)im_end来区分两个文本,我是尽量填充到最大长度的

kingpingyue commented 7 months ago

就是 例如一篇文章,我怎么把这篇文章处理成可以训练模型的数据,代码我没太看懂

kingpingyue commented 7 months ago
input_ids = [np.array(item) for item in outputs["input_ids"]]

这句我没看懂是为什么

kingpingyue commented 7 months ago

为啥要转np.arrary啊

jiahe7ay commented 7 months ago

如果词表大小小于 65535 用uint16存储,节省磁盘空间,否则用uint32存储

kingpingyue commented 7 months ago

哦哦 其实 input_batch = [] input_batch.append(input_ids)类似,指定数据类型会节省磁盘空间