pkiraly / metadata-qa-api

Metadata Quality Assessment Framework API
http://pkiraly.github.io/
GNU General Public License v3.0
14 stars 5 forks source link

Help with multiliguality #56

Open mielvds opened 4 years ago

mielvds commented 4 years ago

Hi @pkiraly

this feature is super interesting! But

Cheers

pkiraly commented 3 years ago

Hi @mielvds,

it calculates some multilinguality metrics based on the language tag which is available in JSON or XML, such as

dc:subject: [ "library"@en, "bibliotheek"@nl ]

This is a multilingual field value with two languages.

The API calculates metrics on field level and and on record level.

Field level metrics:

Record level metrics:

mielvds commented 3 years ago

hmm ok, then I can't use it right now. There are no language tags.

I'm looking for language detection basically, because we have fields that are mixed and I want to figure out the distribution.

pkiraly commented 3 years ago

Once I did experience with language detection, and the code contains a dependency for lib in that area. I stopped it at a point because usually language detection did not worked well for very short text typical in metadata record, but we can restart playing with it.

    <dependency>
      <groupId>com.optimaize.languagedetector</groupId>
      <artifactId>language-detector</artifactId>
      <version>0.6</version>
    </dependency>

Seems this lib has not been developed since 2016 (https://github.com/optimaize/language-detector).

mielvds commented 3 years ago

I've done something similar in python. I can see whether I can implement something similar here.

LanguageDetectionCalculator by implementing https://github.com/pkiraly/metadata-qa-api/blob/a104aa3457ff68ffb997615654d77f5f70de7167/src/main/java/de/gwdg/metadataqa/api/interfaces/Calculator.java? Could be an extension of the current LanguageCalculator as well, for example: new LanguageCalculator(schema, true) where true means languages are detected and not extracted from tags.

pkiraly commented 3 years ago

Sounds promissing. You might take a look on this discussion, which gives a comparision of some language detector libraries: https://github.com/optimaize/language-detector/issues/107. I am not an expert in this, so there might be other relevant libraries.