ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Teach Tika to spot Scots Gaelic #94

Open anjackson opened 7 years ago

anjackson commented 7 years ago

The library used by Tika already spots Welsh, but needs to be taught to spot Scots Gaelic (gd). Detailed instructions here.

The training text should be rather clean; it is a good idea to remove parts written in other languages (like English phrases, or Latin script content in a Cyrillic text for example). Some also like to remove proper nouns like (international) place names in case there are too many. It's up to you how far you go. As a general rule, the cleaner the text is, the better is its profile. If you scrape text from Wikipedia then please only use the main content, without the left side navigation etc.

If we can get a reasonable chunk of text from our NLS colleagues, we should be able to add this easily enough. We might also be able to improve the Welsh language detection by providing data from a larger corpus.

EDIT: It would be interesting to teach it Scots too, but there might be technical barriers as the language detector appears to use two-character ISO 639-1 codes and Scots doesn't have one of those! The same applies to Scottish English, but that would probably be rather hard to spot anyway.

ymaurer commented 7 years ago

I would maybe recommend using CLD2 instead of Tika for language recognition. It has a lot more languages, is more accurate and orders of magnitude faster. Gaelic is included: {"scots_gaelic", "gd", SCOTS_GAELIC + W10, 0} and Welsh {"welsh", "cy", WELSH + W10, 0} https://github.com/CLD2Owners/cld2 For a (quite old) comparison between the different systems, I can offer this article: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

anjackson commented 7 years ago

Thanks for this. I'm aware of CLD2 but chose not to use it as it requires bundling a native library rather than being something I can trivially reuse from Java.

Of course this is not unsurmountable but I'm not currently sure how best to package these kinds of dependencies, especially when running map-reduce jobs.

anjackson commented 1 year ago

Noting also there appears to be a new set of libraries in mutiple programming languages, with clearer support for adding new natural languages:

anjackson commented 1 year ago

Adding that CommonCrawl have a Java wrapper for CLD2, but it's a bit of a pain to work with as it has to be built locally and doesn't bundle binaries etc. https://github.com/commoncrawl/language-detection-cld2

anjackson commented 1 year ago

To summarize what happened with Scots Gaelic, I did work up a contribution to Optimaize but that project appears to be dead: https://github.com/optimaize/language-detector/issues/81

However, IIRC, the detector did not appear to be good at distinguishing Scots and Irish Gaelic. The new tricks used by Lingua might pay off.