Closed kefniark closed 2 years ago
Started to investigate the idea of pre-building small n-gram dictionaries to identify gram unique to a language in a family.
Example
Spanish - Portuguese
, English-Dutch-German
and identify grams unique to each language in those family.Tried and it was slightly working for some pair of languages, but not for other and even cause some accuracy drop for some. And overall the result was far from useful, only +0.25% accuracy for lot of dedicated code and data. I decided to give up on that and focus on other area for the moment
Description
Some pair of language are always at the top of the detection errors:
And all of them make sense, dutch and english are really close, same for portuguese and spanish.
The idea is to find a way to reduce the error rate by putting some extra weight on grams in only one language of the pair.