Open LydiaXiaohongLi opened 4 years ago
Hi LydiaXiaohongLi,
I recommend you to look into google's vocab first. There are various versions of vocab : English-Cased, English-uncased, Multilingual-Cased, Multilingual-uncased, etc.
Those vocabs imply that lower-casing is an option. (answer to question 1)
And if you check those vocabs, punctuations are included. You don't need to remove punctuations. (answer to question 2)
If you build vocab with my project or others, vocab would be ordered by frequency except some special tokens on the top of vocab. (answer to question 4)
Thanks kwonmha, Follow up on the punctuation removal question: If I don't remove punctuation in the corpus file, I will see vocab built for cases like words followed by punct as a single vocab toke, e.g. "hello," . Hence want to ask if should build vocab with corpus without punctuation, then add back punctuation manually as seperate standalone tokens?
Thanks Regards
Subword vocab building algorithm will automatically separate 'hello,' into "hello" and ",". Because "," appears to be follow many other words like "wow,", "well,". So it won't be tied to other vocabs unless there are plenty of "hello,"s.
Hi @kwonmha, the vocab file that I generate has issue with punchtuations.
-(Q). (Proc. (Price, (Poon (Polyak, (Polyak (PoPPCA) (Pinto (Photo (Pham (Petersen (Perron, (Pearl, (Pati (Palatucci (Paccanaro (PSD) (PMF).
Could you please suggest how can I separate the punctuations? Does that involve further preprocessing?
I fixed this problem. Check if it works. Thank you
Hi Kwonmha, Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:
Thanks! Regards