Localisation support for string metadata

janik-martin commented 2 years ago

We want to use LinkML in our ontology workflow, define schema in yaml, then export json-ld context, json-ld ontology, probably artefacts for TS or Java...

What we would really like to see is some sort of localisation support for writing classes and slots metadata like title and description in multiple languages. It would then get exported as @langcode into the resulting ontology. Example: pper:PhysicalPerson a rdf:Class, owl:Class ; rdfs:title "The class of Physical Person"@en ; rdfs:title "Trieda fyzickej osoby"@sk ;

This was already discussed with @cmungall a bit.. so: What I can think of, is to have external language files, one per localisation, which in fact would be a key-value text files, where keys are the actual strings from source yaml, values would be the translated strings. A general default language setting would also be useful. If there is some translation missing in some language, this would simply be missing in the resulting owl file too. If no language files are provided, then the behaviour is untouched, so people not interested in localisations would not be forced to make some default lang file or so.. Also generators, where localisation makes no sense (e.g. TypeScript or Java) would use strings directly from yaml or use the default language property to export strings from this lang file.

This way,

the model of cardinality would be untouched (keeping title single-valued),
even existing schemas could get localisations (even maybe just for the most important classes and slots),
and the actual localisation files could be created by some other external tools, which is necessary when managing large schemas with possibly lots of translations.

I would love to get some thoughts on this.. Thanks

matentzn commented 2 years ago

This is very premature, but we are currently working on a system to manage language translations for ontologies here: https://github.com/monarch-initiative/babelon. This is using a linkML model under the hood, but its purpose is not just to capture the translated values, but the entire translations with metadata like provenance etc.

Eventually, this is used to manage translations for HPO (multilingual, french, dutch etc). We are currently writing parsers from common translation formats like xliff.

I am not quite certain whether our use cases overlap here, but if they do, happy to see if we can align on some things :)

kltm commented 2 years ago

From my experience translation can be hard, depending on scope and desired infrastructure. I'm curious is there is meant to be a large translation push involved in this with related infrastructure (thinking of previous HPO work (crowdin.com, IIRC) or efforts like https://launchpad.net/+tour/translation). I was also wondering if other existing infrastructures for software could be leveraged, like gettext translation / po files and the like. (It might be interesting to ping people from the international biohackathon world to get their perspective on what might be useful.)

matentzn commented 2 years ago

Yeah, we do not concern ourselves with the act of translation, nor with pushing to get more translations - Peter is doing something in this direction. Our little project is simply to parse the xliff files from crowdin (and various other ad hoc formats like tables) to the internal babelon form, validate it and translate it to rdf.. Our scope is small currently, just proportional to our resources. :)

janik-martin commented 2 years ago

Thanks for your comments, @matentzn Chris already pointed us on your project and though it's not what I was primarily looking for, I can imagine it could be a 'step' in a tool chain. Like exporting the ontology with linkml and then add translations with your tool.. One downside of this from my very first thoughts is, dealing with various output formats for OWL (TTL, JSON-LD, XML..), doing the replacement after the export would mean, the tool should be able to add translations in all of these formats.. As for the input format of the localisation files, this might be a good point for expanding the functionality (support for multiple formats, maybe direct integration with online translation services etc.), as I get it, this would be in-line with my suggested concept (or not?).

matentzn commented 2 years ago

Thank you @janik-martin - as of yet, I am also still uncertain about it all. In any case, I will follow your ideas moving forward, and feel free to loop me in when there is other discussions to be had. Just to be clear: my concern is to capture rich metadata about a translation, and using LinkML standard transformation processes translate between OWL/JSON/RDF syntaxes. I envision these translations to be merged into ontologies using other tools such as ROBOT. I have not thought of anything else so far!

ddooley commented 1 year ago

It would be great to get resolution on this. I am getting pressure to have french translations of schema slot title, description, enumerations, etc. I'd like having a LinkML schema-centric solution, recognizing that often slots wouldn't have the ontology labels associated with their slot_uri's; same for enumeration meanings etc. Seems like a mirror schema that only deals with selected string translation, like schema.fr.yaml would be possible, based on schema.yaml (english).

ddooley commented 1 year ago

In fact we now have the go-ahead to make DataHarmonizer bilingual so making LinkML schemas multilingual is a necessity, hopefully not in a too round-about way! Thought of having all translatable class, slot and enum entities have class_uri, slot_uri, meaning properties be filled in, then a lookup json table for those uri's is employed in the application for multilingual labels, descriptions etc. however the same uri can't have different labels in different situations, so that starts to make the schema.[locale].yaml solution look better.

ddooley commented 1 year ago

Just an update that we are working on a proposal / prototype for internationalized/multilingual schemas. Basically we have a default (usually english) schema, and then a sparse mirror of it for all elements that have a translation in another language. Then when one wants to provide a content view in another language, one overlays the language variant onto the default schema. (Note however stored data having multilingual enum strings must stay stored as default language tho.) We’ll have a demo of that soon for dataharmonizer.

linkml / linkml

Localisation support for string metadata #694