Open fosterjen opened 4 years ago
I lowered the filter thresholds from 1
to 0.5
for CharacterScoreFilter
, and LanguageIDFilter
for both tools langid
and cld2
.
The following lines in tests/data/ga_clean_examples.txt
are now kept (where they had been removed before with a threshold of 1
). This should hopefully allow us to include sentences which contain code-mixing or foreign characters, e.g. Greek/Arabic symbols etc. After that, we can try a more strict ratio of 0.7
or so to see does that help.
Léaráid ón leabhar A Manual of Diseases of the Nervous System , ón bhliain 1886 .
Tá an tSind ( Sindis : سنڌ , Urdais : سندھ , Araibis : السند ) ar cheann de cheithre cúigí na Pacastáine agus go stairiúil is í tír dhúchais na Sindigh í .
With very aggressive filtering, we don't see improvements over the unfiltered results:
https://docs.google.com/spreadsheets/d/1ssKM8xQZSTED_-mhVsmhercU9zmMxYxHmxB06wZM-wY/edit#gid=1677680531
See what happens when we keep more, e.g. sentences containing titles in English.