tarrade / proj_multilingual_text_classification

Explore multilingal text classification using embedding, bert and deep learning architecture
Apache License 2.0
5 stars 2 forks source link

What preprocessing steps are required / possible when using BERT? #7

Closed vluechinger closed 4 years ago

vluechinger commented 4 years ago

Traditional preprocessing techniques known from classical machine learning like stemming, lemmatization, etc. are not applied in this context since BERT has a different structure than traditional methods.

Text data should not be used without examination first, though, since there can still be structures that need to be taken care of.

Examples for this are html expressions (e.g. \
\
) or characters like /.

vluechinger commented 4 years ago

Findings from the tokenizer experiments in notebook 09:

We do not know how to tackle misspellings, yet.

tarrade commented 4 years ago

I think this is well under control. 2 links below with some nice input and suggestion: https://github.com/google-research/bert#pre-training-tips-and-caveats https://stackoverflow.com/questions/54938815/data-preprocessing-for-nlp-pre-training-models-e-g-elmo-bert