Closed vluechinger closed 4 years ago
Findings from the tokenizer experiments in notebook 09:
We do not know how to tackle misspellings, yet.
I think this is well under control. 2 links below with some nice input and suggestion: https://github.com/google-research/bert#pre-training-tips-and-caveats https://stackoverflow.com/questions/54938815/data-preprocessing-for-nlp-pre-training-models-e-g-elmo-bert
Traditional preprocessing techniques known from classical machine learning like stemming, lemmatization, etc. are not applied in this context since BERT has a different structure than traditional methods.
Text data should not be used without examination first, though, since there can still be structures that need to be taken care of.
Examples for this are html expressions (e.g. \
\) or characters like /.