Try to find a filtering sweet spot

I lowered the filter thresholds from 1 to 0.5 for CharacterScoreFilter, and LanguageIDFilter for both tools langid and cld2.

The following lines in tests/data/ga_clean_examples.txt are now kept (where they had been removed before with a threshold of 1). This should hopefully allow us to include sentences which contain code-mixing or foreign characters, e.g. Greek/Arabic symbols etc. After that, we can try a more strict ratio of 0.7 or so to see does that help.

Léaráid ón leabhar A Manual of Diseases of the Nervous System , ón bhliain 1886 .
Tá an tSind ( Sindis : سنڌ , Urdais : سندھ , Araibis : السند ) ar cheann de cheithre cúigí na Pacastáine agus go stairiúil is í tír dhúchais na Sindigh í .

jbrry / Irish-BERT

Try to find a filtering sweet spot #9