information-artifact-ontology / ontology-metadata

OBO Metadata Ontology
Creative Commons Zero v1.0 Universal
19 stars 8 forks source link

Document the default language in OBO ontologies #128

Open matentzn opened 1 year ago

matentzn commented 1 year ago

In OBO we make the assumption that "no language tag means english". This is fine internally, practical, due to our english language label requirement as part of the admission process, but it would be prudent to explicitly document the default language (i.e. the language that should be assumption for all literals without a language tag) on ontology metadata level.

Looking at MOD 2.0,

I think the suggestion is to do this:

<http://purl.obolibrary.org/obo/mondo.owl> dcterms:language "en".

The real range of the dcterms:language is http://purl.org/dc/terms/LinguisticSystem, which I don't know how to use, but I think a simple language code should be fine and much easier then some convoluted IRI representing the language. If people are tripped up by this we can do:

<http://purl.obolibrary.org/obo/mondo.owl> dce:language "en".

as well. But I prefer the former to help fading out the dce namespace.

cmungall commented 1 year ago

I support this, but dcterms only. no dce.

what is the expected cardinality? presumably 0..1?

What is expected behavior or robot merge and extract?

Can we come up with some validation rules

strawperson:

the objective here is to have predictable behavior for retrieving single-valued properties like label, definition, etc in a multivalued context

matentzn commented 1 year ago

I agree with all that you say! 0..1 cardinality.

What is expected behavior or robot merge and extract?

Is it important to discuss this here and now? merge is going to be super problematic to get right, but I don't see why we need to deal with extract specifically right now. Another hard part is robot report in this context.

I guess the point I want to make is: the proposal is to document a common practice. If we start tying this to the difficulty of implementing tool support it will become harder to push this issue.

jonquet commented 1 year ago

Hello, MOD suggests to use dct:language to identify the languages in which we can find label inside the ontology. We have identified doap:language, omv:naturalLanguage, schema:inLanguage also. In AgroPortal we use values URI from Lexvo e.g., http://lexvo.org/id/iso639-3/eng

Attention, there is no "default" natural language property in MOD. In fact, I never really thought about the need to express 1 (and only 1) default language and maybe multiple other ones. Certainly because the situation can occur where none of all the natural language declared would cover the full ontology.

So if the group decide to have a property for the "default language " it also needs to decide a property for all the other ones. In that case, I would suggest: mod:defaultLanguage subProperty of dct:language => to encode the 0..1 default language (new property in MOD or IAO) dct:language => to encode the other languages

2 notes: In AgroPortal, we need to know all the natural languages of an ontology to implement the multilingual capability (currently being developed => https://github.com/agroportal/project-management/issues/307). Also, as it is not multilingual yet we have setup the portal with a default language (en).

graybeal commented 1 year ago

The DataCite approach is that if it only has one language, that is the only one declared; they do not indicate any mechanism to indicate one language is 'default' more than the others. But I think it is useful to declare a primary language if that is the case (and it is for OBO ontologies). For our metadata files on one project we followed your pattern of generating a 'defaultLanguage' property, that feels like a good solution to me. ("Primary language used to present the data file (if multiple languages are present, the Other Languages field may be used to add additional languages).")

matentzn commented 1 year ago

Thank you @graybeal and @jonquet ; it seems like a property of this kind would be universally useful. If we make it a child of dc:language, I am worried that people start crying about the range violation; do you think @jonquet this would be a problem (maybe we have enough tissues)? I would be fine with it.

Remains to be seen what is the right home for it; mod and omo are certainly possibilities. Any opinions here? I would have thought that skos or perhaps skosxl would have been good homes too, as languages seem to be like a universal concern in these domains as well?

matentzn commented 1 year ago

(comment to self: protege has that for many years:

DataPropertyAssertion(<http://protege.stanford.edu/plugins/owl/protege#defaultLanguage> <http://www.co-ode.org/ontologies/pizza/2005/10/18/pizza.owl> "en"^^xsd:string)

)

cmungall commented 4 months ago

Any further thoughts here?

matentzn commented 4 months ago

The easiest way to move forward here is creating an OMO property (I can do it in 5 min), but since we may want to use this for other kinds of semantic artefacts like schemas, I guess the remaining question is: where should we request this property to be added?

jonquet commented 4 months ago

I really recommend to go for dct:language or extend it in mod namespace. MOD is here waiting for maturity, adopters and contributors. Also, on another track: since december 2023 (my last post was in April) AgroPortal is now multilingual (see : https://doc.jonquetlab.lirmm.fr/share/e6158eda-c109-4385-852c-51a42de9a412/doc/release-notes-btKjZk5tU2) and we rely on http://omv.ontoware.org/2005/05/ontology#naturalLanguage (which in our case was chosen to stay consistent with BioPortal historical choices to rely on OMV).

alanruttenberg commented 4 months ago

I'm not fond of assuming, in released ontologies, that xsd:string means @en. It seems like a good idea to announce policy with a property as suggested but I wonder whether builds could incorporate a step where they change xsd:strings in annotations known to have language-specific values (definitions, comments, editorial notes, maybe labels) to language tagged literals?

matentzn commented 4 months ago

I wonder whether builds could incorporate a step where they change xsd:strings in annotations known to have language-specific values (definitions, comments, editorial notes, maybe labels) to language tagged literals?

It is an option for the OWL formats - not sure what it will do to the other serialisations like OBO, but for OWL this is definitely an option!

matentzn commented 4 months ago

@jonquet I don't mind adding the property to MOD due to its wide applicability beyond OBO, but re "extension" - are you not concerned about the range restriction on dct:language? It is supposed to be https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#LinguisticSystem, which according to their spec is supposed to be a class. To make things easier for us I really think the value of "defaultLanguage" should be a ISO language string., like en, fr etc.

jonquet commented 4 months ago

DCT spec declares range with something more flexible than RDF: Range Includes https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/language

And define Ranges includes here : https://www.dublincore.org/specifications/dublin-core/dcmi-terms/

Capture d’écran 2024-02-28 à 14 44 03

So it is to me ok to extend (rdfs:subPropertyOf) a DCT property and refine the range as the range of the super property is "flexible"

Aside this discussion: AgroPortal (which for backward compatibility uses omv:naturalLanguage) enforces the use of URI from Lexvo with ISO-639-1 values ... we have tried ISO-639-3 but its too much stuff, not really used.