I want to use the Megatron framework for Chinese NLP pre-training tasks. Currently, I have Chinese corpus resources and a vocab.txt file. However, for most frameworks, it seems that vocab.json and merge.txt are needed. Can I generate the above two files from Chinese corpus resources? If so, how can I generate them? Sorry, I haven't found a particularly suitable tutorial on Google.
I want to use the Megatron framework for Chinese NLP pre-training tasks. Currently, I have Chinese corpus resources and a vocab.txt file. However, for most frameworks, it seems that vocab.json and merge.txt are needed. Can I generate the above two files from Chinese corpus resources? If so, how can I generate them? Sorry, I haven't found a particularly suitable tutorial on Google.