Open joncto opened 1 year ago
Langauge detecting is done in the warc-indexer. See in config3.xml.next to the warc-indexer. Add the langauge code if it is not there. But I know know how good support there is for sami. Also moving the langanges you expect to find most to the top will speed up indexing a little since it can match faster.
@joncto language detection is handled by optimaize/language-detector 0.6 and unfortunately it does not seem to support Sámi languages. If you only index content using Sámi languages, you might want to turn off detection. If you have some content in other langues, e.g. English, trimming the langdetectprofiles
down to those other languages should reduce the amount of false positives.
Stopwords are not supported directly in SolrWayback or webarchive-discovery but can be added to the Solr schema, which should have the intended effect. See the Solr documentation on Stop Filter for details.
The schema.xml
provided by the webarchive-discovery project as well as its mirror in the SolrWayback bundle uses stopwords for the path
field. This can be used as a sample for applying a similar filter to the text_general
field type in the schema, using the lists you link to.
Addendum: @anjackson and I talked about making webarchive-discovery more easily extensible for special processing. If you are aware of an external language detector that support Sámi languages and have time for implementation, this could be a fine task for implementing such an extension mechanism.
If such a detector is available in Java, it is should be possible to write a custom webarchive-discovery plugin, although I don't have experience with the process - Andy might give hints here?
FWIW, when trying to detect Scots Gaelic, I made some notes here: https://github.com/ukwa/webarchive-discovery/issues/94
As noted there, the Optimaize project appears to be dead. There is a new option called Lingua (https://github.com/pemistahl/lingua) that seems to be active, so one option would be to add Sámi support there and then add a Tika or webarchive-discovery module that uses it. If anyone has working Java code, I can try making a suitable module. (/CCing @tballison in case he has any plans for Tika in this regard...)
Here's an example text analyser class. This could be updated to use something else instead of Tika's wrapped version of Optimaize: https://github.com/ukwa/webarchive-discovery/blob/master/warc-indexer/src/main/java/uk/bl/wa/analyser/text/LanguageAnalyser.java
Addendum: The only issue with adding modules to Tika or webarchive-discovery is that if they require complex dependencies, that can become a bit of a nightmare. That's part of the reason why I'm interesting in supporting an additional workflow that generates JSONL and/or CDXJ files containing the extracted text in one pass, and then writing more modular tools that consume and enrich those files.
Testing to index our archive with this bundle, and build a service for researchers. In the initial test, we experienced that texts in Sámi languages are identified as Estonian (content_language:"et").
We have stopword lists for five Sámi languages we would like to use, but not sure how this is implemented the best way. Is it interesting for you to implement support for Sámi languages in the bundle?
All resources are provided by Giellatekno at the University of Tromsø, and licenced under GNU General Public Licence, version 3:
Northern Sámi ("sme") Lule Sámi ("smj") Southern Sámi ("sma") Skolt Sámi ("sms" Inari Sámi ("smn")