microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.88k stars 344 forks source link

[Question] How to generate a merge file and a vocab file #123

Closed Zhang-kg closed 1 year ago

Zhang-kg commented 1 year ago

I want to use the Megatron framework for Chinese NLP pre-training tasks. Currently, I have Chinese corpus resources and a vocab.txt file. However, for most frameworks, it seems that vocab.json and merge.txt are needed. Can I generate the above two files from Chinese corpus resources? If so, how can I generate them? Sorry, I haven't found a particularly suitable tutorial on Google.

whalefa1I commented 1 year ago

extract it from the tokenizer model

lonelydancer commented 9 months ago

@Zhang-kg do you have any idea?