Robustness to missing accents, all-caps text and other deviations from well-edited text

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Robustness to missing accents, all-caps text and other deviations from well-edited text #30

Open jowagner opened 4 years ago

jowagner commented 4 years ago

To make our BERT model more robust to deviations from well-edited text typically found in real-world input, we could augment the training corpus with synthetic text derived from the current corpus by removing some accents from characters in a way to mimic social media content, putting text into all-caps, removing punctuation and/or spaces, using short forms used in text messages in Irish and inserting common spelling errors.

Related: issue #124

jowagner commented 4 years ago

It may work better to make these changes not as a pre-processing step but inside the BERT model only on the input side, e.g. in case of accents to ask BERT to restore accents.

jowagner commented 4 years ago

Issue #32 observes that input text may contain character reference entities such as " instead of plain characters. As this can happen in test input, it is desirable to make our BERT model robust to such encoding errors. It may be possible to achieve this by augmenting the training data with additional text that has some characters replaced with entities. The replacement probabilities should be set to mimic the occurrences in raw text.