jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Try to find a filtering sweet spot #9

Open fosterjen opened 4 years ago

fosterjen commented 4 years ago

With very aggressive filtering, we don't see improvements over the unfiltered results:

https://docs.google.com/spreadsheets/d/1ssKM8xQZSTED_-mhVsmhercU9zmMxYxHmxB06wZM-wY/edit#gid=1677680531

See what happens when we keep more, e.g. sentences containing titles in English.

jbrry commented 4 years ago

I lowered the filter thresholds from 1 to 0.5 for CharacterScoreFilter, and LanguageIDFilter for both tools langid and cld2.

The following lines in tests/data/ga_clean_examples.txt are now kept (where they had been removed before with a threshold of 1). This should hopefully allow us to include sentences which contain code-mixing or foreign characters, e.g. Greek/Arabic symbols etc. After that, we can try a more strict ratio of 0.7 or so to see does that help.

Léaráid ón leabhar A Manual of Diseases of the Nervous System , ón bhliain 1886 .
Tá an tSind ( Sindis : سنڌ , Urdais : سندھ , Araibis : السند ) ar cheann de cheithre cúigí na Pacastáine agus go stairiúil is í tír dhúchais na Sindigh í .