Corpus preprocessing steps

kwonmha / bert-vocab-builder

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

228 stars 47 forks source link

Corpus preprocessing steps #13

Open LydiaXiaohongLi opened 4 years ago

LydiaXiaohongLi commented 4 years ago

Hi Kwonmha, Thanks for open source the repo. Can I ask generally the preprocessing steps for vocab builder, for a uncased bert model is follows:

Convert corpus text file to lower case
Removal punctuations from corpus text file?
Build vocab
match the vocab file to bert model configuration, e.g. take the top 30k lines (as the vocab should be ordered by frequency descending order?), manually adjust the vocab file, so that it contains puncutations (i.e. vocabs for . , ? ! ##. ##, ##? ##! etc)?
use the vocab file for later pretraining bert model, the corpus of pretraining bert model needs to be lower cased, but without removal of punctuation? Let me know if my understanding is not correct?

Thanks! Regards

kwonmha commented 4 years ago

Hi LydiaXiaohongLi,

I recommend you to look into google's vocab first. There are various versions of vocab : English-Cased, English-uncased, Multilingual-Cased, Multilingual-uncased, etc.

Those vocabs imply that lower-casing is an option. (answer to question 1)

And if you check those vocabs, punctuations are included. You don't need to remove punctuations. (answer to question 2)

If you build vocab with my project or others, vocab would be ordered by frequency except some special tokens on the top of vocab. (answer to question 4)

LydiaXiaohongLi commented 4 years ago

Thanks kwonmha, Follow up on the punctuation removal question: If I don't remove punctuation in the corpus file, I will see vocab built for cases like words followed by punct as a single vocab toke, e.g. "hello," . Hence want to ask if should build vocab with corpus without punctuation, then add back punctuation manually as seperate standalone tokens?

Thanks Regards

kwonmha commented 4 years ago

Subword vocab building algorithm will automatically separate 'hello,' into "hello" and ",". Because "," appears to be follow many other words like "wow,", "well,". So it won't be tied to other vocabs unless there are plenty of "hello,"s.

sahelimukherjee92 commented 4 years ago

Hi @kwonmha, the vocab file that I generate has issue with punchtuations.

-(Q). (Proc. (Price, (Poon (Polyak, (Polyak (PoPPCA) (Pinto (Photo (Pham (Petersen (Perron, (Pearl, (Pati (Palatucci (Paccanaro (PSD) (PMF).

Could you please suggest how can I separate the punctuations? Does that involve further preprocessing?

kwonmha commented 3 years ago

I fixed this problem. Check if it works. Thank you