Open anjackson opened 7 years ago
I would maybe recommend using CLD2 instead of Tika for language recognition. It has a lot more languages, is more accurate and orders of magnitude faster. Gaelic is included: {"scots_gaelic", "gd", SCOTS_GAELIC + W10, 0} and Welsh {"welsh", "cy", WELSH + W10, 0} https://github.com/CLD2Owners/cld2 For a (quite old) comparison between the different systems, I can offer this article: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Thanks for this. I'm aware of CLD2 but chose not to use it as it requires bundling a native library rather than being something I can trivially reuse from Java.
Of course this is not unsurmountable but I'm not currently sure how best to package these kinds of dependencies, especially when running map-reduce jobs.
Noting also there appears to be a new set of libraries in mutiple programming languages, with clearer support for adding new natural languages:
Adding that CommonCrawl have a Java wrapper for CLD2, but it's a bit of a pain to work with as it has to be built locally and doesn't bundle binaries etc. https://github.com/commoncrawl/language-detection-cld2
To summarize what happened with Scots Gaelic, I did work up a contribution to Optimaize but that project appears to be dead: https://github.com/optimaize/language-detector/issues/81
However, IIRC, the detector did not appear to be good at distinguishing Scots and Irish Gaelic. The new tricks used by Lingua might pay off.
The library used by Tika already spots Welsh, but needs to be taught to spot Scots Gaelic (gd). Detailed instructions here.
If we can get a reasonable chunk of text from our NLS colleagues, we should be able to add this easily enough. We might also be able to improve the Welsh language detection by providing data from a larger corpus.
EDIT: It would be interesting to teach it Scots too, but there might be technical barriers as the language detector appears to use two-character ISO 639-1 codes and Scots doesn't have one of those! The same applies to Scottish English, but that would probably be rather hard to spot anyway.