netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
102 stars 21 forks source link

Lack of language recognition for Sámi languages #379

Open joncto opened 1 year ago

joncto commented 1 year ago

Testing to index our archive with this bundle, and build a service for researchers. In the initial test, we experienced that texts in Sámi languages are identified as Estonian (content_language:"et").

We have stopword lists for five Sámi languages we would like to use, but not sure how this is implemented the best way. Is it interesting for you to implement support for Sámi languages in the bundle?

All resources are provided by Giellatekno at the University of Tromsø, and licenced under GNU General Public Licence, version 3:

Northern Sámi ("sme") Lule Sámi ("smj") Southern Sámi ("sma") Skolt Sámi ("sms" Inari Sámi ("smn")

thomasegense commented 1 year ago

Langauge detecting is done in the warc-indexer. See in config3.xml.next to the warc-indexer. Add the langauge code if it is not there. But I know know how good support there is for sami. Also moving the langanges you expect to find most to the top will speed up indexing a little since it can match faster.

tokee commented 1 year ago

@joncto language detection is handled by optimaize/language-detector 0.6 and unfortunately it does not seem to support Sámi languages. If you only index content using Sámi languages, you might want to turn off detection. If you have some content in other langues, e.g. English, trimming the langdetectprofiles down to those other languages should reduce the amount of false positives.

Stopwords are not supported directly in SolrWayback or webarchive-discovery but can be added to the Solr schema, which should have the intended effect. See the Solr documentation on Stop Filter for details.

The schema.xml provided by the webarchive-discovery project as well as its mirror in the SolrWayback bundle uses stopwords for the path field. This can be used as a sample for applying a similar filter to the text_general field type in the schema, using the lists you link to.

tokee commented 1 year ago

Addendum: @anjackson and I talked about making webarchive-discovery more easily extensible for special processing. If you are aware of an external language detector that support Sámi languages and have time for implementation, this could be a fine task for implementing such an extension mechanism.

If such a detector is available in Java, it is should be possible to write a custom webarchive-discovery plugin, although I don't have experience with the process - Andy might give hints here?

anjackson commented 1 year ago

FWIW, when trying to detect Scots Gaelic, I made some notes here: https://github.com/ukwa/webarchive-discovery/issues/94

As noted there, the Optimaize project appears to be dead. There is a new option called Lingua (https://github.com/pemistahl/lingua) that seems to be active, so one option would be to add Sámi support there and then add a Tika or webarchive-discovery module that uses it. If anyone has working Java code, I can try making a suitable module. (/CCing @tballison in case he has any plans for Tika in this regard...)

Here's an example text analyser class. This could be updated to use something else instead of Tika's wrapped version of Optimaize: https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/analyser/text/LanguageAnalyser.java

Addendum: The only issue with adding modules to Tika or webarchive-discovery is that if they require complex dependencies, that can become a bit of a nightmare. That's part of the reason why I'm interesting in supporting an additional workflow that generates JSONL and/or CDXJ files containing the extracted text in one pass, and then writing more modular tools that consume and enrich those files.