Open mielvds opened 4 years ago
Hi @mielvds,
it calculates some multilinguality metrics based on the language tag which is available in JSON or XML, such as
dc:subject: [ "library"@en, "bibliotheek"@nl ]
This is a multilingual field value with two languages.
The API calculates metrics on field level and and on record level.
Field level metrics:
Record level metrics:
hmm ok, then I can't use it right now. There are no language tags.
I'm looking for language detection basically, because we have fields that are mixed and I want to figure out the distribution.
Once I did experience with language detection, and the code contains a dependency for lib in that area. I stopped it at a point because usually language detection did not worked well for very short text typical in metadata record, but we can restart playing with it.
<dependency>
<groupId>com.optimaize.languagedetector</groupId>
<artifactId>language-detector</artifactId>
<version>0.6</version>
</dependency>
Seems this lib has not been developed since 2016 (https://github.com/optimaize/language-detector).
I've done something similar in python. I can see whether I can implement something similar here.
LanguageDetectionCalculator
by implementing https://github.com/pkiraly/metadata-qa-api/blob/a104aa3457ff68ffb997615654d77f5f70de7167/src/main/java/de/gwdg/metadataqa/api/interfaces/Calculator.java? Could be an extension of the current LanguageCalculator
as well, for example: new LanguageCalculator(schema, true)
where true
means languages are detected and not extracted from tags.
Sounds promissing. You might take a look on this discussion, which gives a comparision of some language detector libraries: https://github.com/optimaize/language-detector/issues/107. I am not an expert in this, so there might be other relevant libraries.
Hi @pkiraly
this feature is super interesting! But
Cheers