大哥麻烦给个数据样本参考一下

jiahe7ay / MINI_LLM

This is a repository used by individuals to experiment and reproduce the pre-training process of LLM.

348 stars 53 forks source link

Open kingpingyue opened 7 months ago

kingpingyue commented 7 months ago

大哥麻烦给个数据样本参考一下，我想了解一下数据处理部分

jiahe7ay commented 7 months ago

"text":xxxxxxxxx （最长为512）im_end来区分两个文本，我是尽量填充到最大长度的

kingpingyue commented 7 months ago

就是例如一篇文章，我怎么把这篇文章处理成可以训练模型的数据，代码我没太看懂

kingpingyue commented 7 months ago

input_ids = [np.array(item) for item in outputs["input_ids"]]

这句我没看懂是为什么

kingpingyue commented 7 months ago

为啥要转np.arrary啊

jiahe7ay commented 7 months ago

如果词表大小小于 65535 用uint16存储，节省磁盘空间，否则用uint32存储

kingpingyue commented 7 months ago

哦哦其实 input_batch = [] input_batch.append(input_ids)类似，指定数据类型会节省磁盘空间