数据集的细节 - Githubissues

thunlp / LegalPLMs

Source code and checkpoints for legal pre-trained language models.

176 stars 25 forks source link

Closed mengzixing closed 2 years ago

mengzixing commented 3 years ago

代码里没有给出训练数据，请问有没有train_files/valid_files的demo数据或者数据格式说明？

HongliMeng commented 3 years ago

请问有没有demo呀

xcjthu commented 2 years ago

Sorry for missing the information. We utilize the tools developed by NVIDIA to store the pre-training data (specifically, in https://github.com/NVIDIA/Megatron-LM/blob/main/tools/preprocess_data.py). We should first convert textual data into token ids.