What preprocessing steps are required / possible when using BERT?

vluechinger commented 4 years ago

Traditional preprocessing techniques known from classical machine learning like stemming, lemmatization, etc. are not applied in this context since BERT has a different structure than traditional methods.

Text data should not be used without examination first, though, since there can still be structures that need to be taken care of.

Examples for this are html expressions (e.g. \
\
) or characters like /.

vluechinger commented 4 years ago

Findings from the tokenizer experiments in notebook 09:

It is important not to eliminate stopwords or punctuation but to leave the sentence structure as it is which is the main difference to a classical machine learning approach. However, this does not mean that preprocessing of the text data should be skipped. Special characters that do not add anything to the meaning of a sentence like html expressions and /\ etc. should be removed since they just clutter up the tokenization process.
The uncased, multilingual tokenizer used by BERT produced the most uniform results over EN, DE and FR. This means that upper cases can be ignored from German sentences. It seems counter-intuitive to do so, but the results did not vary that much. (We think it is best to use one tokenizer for all three languages.)
Especially in the multilingual setting, the German corpus should be cleaned by converting words containing ae, oe or ue either to ä, ö, ü, or to a, o, u because this does not get recognized by the multilingual tokenizer.
Special characters in both French and German do not alter the tokeniation process and can be ignored in preprocessing.

We do not know how to tackle misspellings, yet.

tarrade commented 4 years ago

I think this is well under control. 2 links below with some nice input and suggestion: https://github.com/google-research/bert#pre-training-tips-and-caveats https://stackoverflow.com/questions/54938815/data-preprocessing-for-nlp-pre-training-models-e-g-elmo-bert

tarrade / proj_multilingual_text_classification

What preprocessing steps are required / possible when using BERT? #7