Indicators for I2: (meta)data use vocabularies that follow FAIR principles

makxdekkers commented 5 years ago

Points raised in online meeting 3 on 18 June 2019

It is not explicit which of the FAIR principles should be tested, and how, to determine if (and how much) a (meta)data vocabulary complies with FAIR principles.
It is not explicit how to measure the FAIR compliance of vocabularies.
Apply the maturity indicators to the vocabulary itself

GCoen1 commented 5 years ago

Hi Makx, I've been studying this problem at DANS since end of 2017 as part of a Knowledge Organization Systems (KOS) project with Peter Doorn & Richard Smiraglia. I also work on the topic of 'FAIR Semantics' and interoperability with Yann le Franc as part of the FAIRsFAIR project. One way of viewing this requirement which is particularly useful is to consider KOS (a.k.a. semantic artefacts/resources; ontologies; vocabularies; taxonomies; etc) themselves as digital objects which need to be FAIR. I think that for achieving the above requirement of using "vocabularies that follow the FAIR principles" we need to start with making the semantic stack FAIR, working upwards from UNICODE. Simultaneously we need to work with all communities to make sure that their vocabularies begin to move along Tim Berners-Lee's 5-star Open Data model, also making scientists aware of the available in-use KOS for their domain/need. Registries of KOSs such as BARTOC.org and OBO Foundry need to be integrated into the researcher's toolbox and the citation of all resources (Data, Software, KOSs, etc) used to develop a digital object (scholarly output) needs to become the norm. We should also encourage the developers of these resources to adopt an as of yet undefined set of standards and policies related to issues such as version control, stability(in terms of update frequency), maintenance/sustainability, and archival. Translating the FAIR metrics into resource-specific metrics for KOS which have meaning is already underway at DANS. Of course, machine/automated assessment, monitoring and validation of the FAIRness of KOS should be the goal for EOSC. If I can provide you with something more concrete which might be useful related to the I in FAIR please let me know. Gerry Coen

makxdekkers commented 5 years ago

Many thanks Gerry! I think you are right that FAIRness of KOS needs to be addressed first and foremost by the developers and maintainers of KOSs. Your work on this at DANS should be very useful in defining the set of standards and policies that will ensure that KOSs become as FAIR as possible. Links to your work would be very helpful.

One thing that may need to be considered, however, is the question of how much can be expected from creators and curators of metadata for datasets. It seems to me that requiring that all metadata is based on fully FAIR KOS might be too much to ask. If it is, as you write, that currently many KOSs still need to adopt FAIR standards and policies, is there maybe a subset of FAIR principles that, for the moment, could be sufficient when evaluating the FAIRness of the metadata of a dataset? For example, at least being publicly accessible and free to use?

I was thinking that, if the requirement is not easy to satisfy, the result could be that very few data/metadata can be considered FAIR, or that metadata creators and curators could be discouraged to use vocabularies that are widely recognised and/or useful for the (re)user, if those KOS somehow fail to be considered fully FAIR.

So, is it possible to identify the crucial aspects of FAIRness of KOSs that could be used as an indicator in the evaluation of the FAIRness of the dataset/metadata that uses the KOS?

makxdekkers commented 5 years ago

This is also related to issue #14.

GCoen1 commented 5 years ago

The DANS KOS Observatory is a student project of mine and there is a journal article and two conference papers pending for that. For the observatory data itself I am working on making it searchable (and Open & FAIR). The project has many moving parts in the context of the growing FAIR environment. I think that for the I in FAIR it is too early to talk about requirements and assessment in any strict sense of the word, but guidelines and advice are definitely a good start.

For establishing whether a 'vocabulary' is FAIR maybe we could begin by considering something like:

Findable: F4 - Is published in an established registry e.g:

BARTOC.org http://bartoc.org/ | Basel Register of Thesauri, Ontologies & Classifications (of which I am an editor).
Linked Open Vocabularies (LOV) https://lov.linkeddata.es/dataset/lov | Provides a choice of several hundreds of LOD vocabularies, based on quality requirements including URI stability and availability on the Web, use of standard formats and publication best practices, quality metadata and documentation, identifiable and trustable publication body, proper versioning policy.
BioPortal http://bioportal.bioontology.org/ | the world's most comprehensive repository of biomedical ontologies
FINTO http://finto.fi/en/ | a Finnish thesaurus and ontology service, which enables both the publication and browsing of vocabularies.
Heritage Data http://www.heritagedata.org/blog/vocabularies-provided/ | Linked Data Vocabularies for Cultural Heritage.
Conservation controlled vocabularies https://www.ligatus.org.uk/lcd/controlled-vocabularies | by Linked Conservation Data consortium
Library of Congress Linked Data Services http://id.loc.gov/ | – Authorities and Vocabularies
EU Vocabularies https://publications.europa.eu/en/web/eu-vocabularies | Access to vocabularies managed by the EU institutions and bodies. This includes controlled vocabularies, schemas, ontologies, data models, etc. (The team of Denis Dechandon is also working on multilingual knowledge graphs and semantic interoperability).
Getty Vocabularies LOD http://vocab.getty.edu/ | Provides multiple Getty vocabularies, (AAT, TGN, and ULAN), with a comprehensive list of query templates and documentation.
Metadata Registry http://metadataregistry.org/vocabulary/list.html | The Registry provides a means for to identify, declare and publish through registration their metadata schemas (element/property sets), schemes (controlled vocabularies) and Application Profiles (APs).
OBO Foundry http://www.obofoundry.org/ | The Open Biological and Biomedical Ontology (OBO) Foundry is a collective of ontology developers that are committed to collaboration and adherence to shared principles.

**The previous version of DataHub https://old.datahub.io/dataset still contains a large registry. I don't know why this has become the 'old' version. Incidentally DANS has set out in its 2018 Research Programme the ambition to become a reference archive for "Endangered Knowledge Organization Systems" meaning KOS at risk of disappearing due to link rot, or lost access to knowledge due to content shift/drift & semantic drift.

Accessible: A1.1. (but not A1.2) - Follows the W3C RDF 1.1 Semantics Recommendation https://www.w3.org/TR/rdf11-mt/ . And implements EU best practice guidelines related to Persistent Unique Resource Identifiers https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/document/10-rules-persistent-uris

Interoperable: I1 & I3 - Is published in a machine-readable format which uses W3C standard RDF mapping schemas. RDF and RDFS https://www.w3.org/TR/rdf11-primer/ W3C-OWL (OWL 2 / Full; OWL 2 / EL; OWL 2 / QL; OWL 2 / RL) stack https://www.w3.org/TR/owl2-profiles/ SKOS https://www.w3.org/TR/skos-primer/

Reusable: R1 (R1.1 & R1.2) - Has metadata published according to the "Networked Knowledge Organization Systems Dublin Core Application Profile (NKOS AP)". The metadata schema to describe knowledge organization systems/services (KOS) resources, such as thesauri, classification schemes, subject heading systems, taxonomies, and ontologies. http://nkos.slis.kent.edu/nkos-ap.html

About being "free to use". At EOSC18 Barend Mons discussed this point highlighting that 'FAIR data is: Not a standard; Not equal to Open or Free; and Not equal to the Semantic Web. Pulling this apart is very complicated when you consider RDF, this is why I commented 'not A1.2' above.

The pragmatic approach should be to ensure KOS are always cited/referenced etc whenever they are used. That starts the process of building a graph to establish at least that the KOS are widely recognised/accepted within communities. There is a heavy social aspect to the challenge.

For FAIRsFAIR we will have a colocated event on FAIR Semantics at RDA P14 on the 22nd and I think there will also be a related session by the Vocabulary Services IG.

I hope someone else pops up to comment on this as well.

makxdekkers commented 5 years ago

@GCoen1 I understand a possible approach could be to identify the principles F4, A1.1, I1, I3, R1.1 and R1.2 as relevant for the KOS used. The question still remains what to do with KOSs that do not meet these requirements, but are still understood and relevant for the target audience of a dataset. Would a dataset that is described with a not fully FAIR vocabulary be, as a result, not FAIR, or should the use of fully FAIR vocabularies be recommended but not mandatory?

keithjeffery commented 5 years ago

A superb discussion - thanks. Just one point; however good the W3C recommendations are (like RDF) we must always leave open the ability to use othe technologies for KOS (semantics) and for metadata syntax. In some systems, for performance rasons, a syntax using extended relaitonal or object-oriented technology (or even Prolog) is preferable and for both metadata syntax and semanics the use of n-tuples rather than triples can be a great advantage - especially if dealing also with temporal aspects (the time period during which an assertion is true) and vocabulary cross-walks (becaue of richness of relationship roles - can even be modal (adding probabilties of equivaence for example). This all links - of course - with provenance and curation and instead of having vocabularies 'outside' the main metadata storage structure they can be 'inside' and integrated.

makxdekkers commented 5 years ago

@keithjeffery How do you feel about @GCoen1's suggestion to limit the FAIR principles for KOS to F4, A1.1, I1, I3, R1.1 and R1.2? Or would you suggest other characteristics to be relevant?

keithjeffery commented 5 years ago

Makx – Frankly I am not happy having spent some time thinking about this. I believe any associated KOS has to satisfy all the FAIR principles. If a vocabulary system is integrated with the formal syntax metadata (i.e. the ‘descriptive’ and ‘administrative’ and ’structural’ metadata) then it would have to obey all the FAIR principles so the same should apply to external KOS Best Keith

Keith G Jeffery Consultants Prof Keith G Jeffery E: keith.jeffery@keithgjefferyconsultants.co.ukmailto:keith.jeffery@keithgjefferyconsultants.co.uk T: +44 7768 446088 S: keithgjeffery

The contents of this email are sent in confidence for the use of the intended recipient only. If you are not one of the intended recipients do not take action on it or show it to anyone else, but return this email to the sender and delete your copy of it.

From: makxdekkers notifications@github.com Sent: 01 July 2019 17:08 To: RDA-FAIR/FAIR-data-maturity-model-WG FAIR-data-maturity-model-WG@noreply.github.com Cc: Keith Jeffery Keith.Jeffery@keithgjefferyconsultants.co.uk; Mention mention@noreply.github.com Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Indicators for I2: (meta)data use vocabularies that follow FAIR principles (#24)

@keithjefferyhttps://github.com/keithjeffery How do you feel about @GCoen1https://github.com/GCoen1's suggestion to limit the FAIR principles for KOS to F4, A1.1, I1, I3, R1.1 and R1.2? Or would you suggest other characteristics to be relevant?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24?email_source=notifications&email_token=ADALU52H7T42HC4MK7RWTBTP5ITXBA5CNFSM4H255VB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6TS2A#issuecomment-507328872, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADALU52PDHDKODVHBHH64LTP5ITXBANCNFSM4H255VBQ.

makxdekkers commented 5 years ago

@keithjeffery Thanks. It seems to me that the preliminary conclusion of the discussion so far is that both indicators should be retained.

The next question is whether it is mandatory for both indicators to be satisfied. As I wrote above in https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24#issuecomment-506515117, it might not be useful to say that FAIR data MUST use FAIR vocabularies in all circumstances, especially as there could be very few vocabularies that are fully FAIR. It could be a situation that we want to encourage but maybe it is something that can be mandated at a future time, and can be a recommendation at this time. Again, to be practical.

keithjeffery commented 5 years ago

Makx – I agree that is a practical and feasible way forward Best Keith

Keith G Jeffery Consultants Prof Keith G Jeffery E: keith.jeffery@keithgjefferyconsultants.co.ukmailto:keith.jeffery@keithgjefferyconsultants.co.uk T: +44 7768 446088 S: keithgjeffery

The contents of this email are sent in confidence for the use of the intended recipient only. If you are not one of the intended recipients do not take action on it or show it to anyone else, but return this email to the sender and delete your copy of it.

From: makxdekkers notifications@github.com Sent: 02 July 2019 14:30 To: RDA-FAIR/FAIR-data-maturity-model-WG FAIR-data-maturity-model-WG@noreply.github.com Cc: Keith Jeffery Keith.Jeffery@keithgjefferyconsultants.co.uk; Mention mention@noreply.github.com Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Indicators for I2: (meta)data use vocabularies that follow FAIR principles (#24)

@keithjefferyhttps://github.com/keithjeffery Thanks. It seems to me that the preliminary conclusion of the discussion so far is that both indicators should be retained.

The next question is whether it is mandatory for both indicators to be satisfied. As I wrote above in #24 (comment)https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24#issuecomment-506515117, it might not be useful to say that FAIR data MUST use FAIR vocabularies in all circumstances, especially as there could be very few vocabularies that are fully FAIR. It could be a situation that we want to encourage but maybe it is something that can be mandated at a future time, and can be a recommendation at this time. Again, to be practical.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24?email_source=notifications&email_token=ADALU52AZ6TLQBTEFLBTSZTP5NJ6HA5CNFSM4H255VB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBIW2Q#issuecomment-507677546, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADALU53ZCIOEMCSGPYYTSLLP5NJ6HANCNFSM4H255VBQ.

GCoen1 commented 5 years ago

Just to be clear, as per my original message and follow-up comment "I think that for the I in FAIR it is too early to talk about requirements and assessment in any strict sense of the word, but guidelines and advice are definitely a good start." I certainly don't agree with limiting FAIR for KOS - not sure how that came across. My view is that KOS are the linchpin to the system.

There needs to be some flexibility in the transition period while FAIR policies and guidelines for KOS can be developed, adopted and implemented but ultimately all digital objects should reach 100% FAIRness, including KOS.

The point on leaving the door open for other FAIR semantic technologies is also very crucial for innovation. Particularly considering EOSC we need to be mindful not to become path dependent. I think that benchmarking could play an important role in this respect. The 5 levels for FAIRNess compliance don't need to be absolute. They could be benchmarks based on the currently existing Semantic Web and Linked Data technologies and standards and updated 'regularly'. The measures should therefore also aim to be technology agnostic as long as they achieve the FAIR principles.

micheldumontier commented 5 years ago

Some providers create their data and metadata using multiple vocabularies, each vocabulary may exhibit different levels of FAIRness. it would be a good for maturity indicator to recognize that the all or nothing might discourage the use of not fully FAIR vocabularies, rather than engaging in a discussing with them about what they could do to improve the FAIRness of the vocabulary, as a means to increase the FAIRness of their own resource.

makxdekkers commented 5 years ago

@micheldumontier If I understand correctly, you agree that the indicator requiring fully FAIR vocabularies should not be mandatory but recommended (satisfied if possible, but not absolutely necessary).

micheldumontier commented 5 years ago

@makxdekkers Ideally, i) we use shared vocabularies, and ii) those vocabularies are fully FAIR. The main rationale for including FAIR vocabularies in the FAIR principles was precisely to get users to become direct stakeholders in the quality (FAIRness) of the vocabularies they use -> and becoming vocal in how these resources need to be further developed in order to maximize the FAIRness of the digital resources they produce. therefore, i would aim for an indicator that recognizes the lack of use of shared vocabularies, the use of some FAIR vocabularies, and the use of fully FAIR vocabularies.

markwilkinson commented 5 years ago

The maturity indicator tests I have written attempt to do what Michel is suggesting. e.g. It takes a survey of all of the metadata properties discovered in a record, and then polls each of them to determine if it is, itself, 'FAIR' (at least, in some primitive way - for me, this means resolvable to something that is machine-processable). From this poll, it generates a ratio - FAIR to non-FAIR, and reports that as its output. Pass/Fail is (for me) arbitrarily set at 50% (I think).

In any case, it is the report that is most informative for the provider.

makxdekkers commented 5 years ago

@markwilkinson You introduce a notion of 'primitive' FAIRness for vocabularies -- at least resolving to a machine-understandable representation. Is this something that could be expressed as an indicator?

SusannaSansone commented 5 years ago

Why do we required a vocabulary to be FAIR, but not any other schema/community standard?

Regardless, this also nicely links with the work of the RDA FAIRsharing WG registry, which is now one of the formally approved RDA outputs.

FAIRsharing works to ensure that these resources are Findable (e.g., by providing DOIs), Accessible (e.g., identifying their level of openness and licence type), encouraged to be Interoperable (e.g., highlighting which repositories implement the same standards to structure and exchange data), and Reusable (e.g., knowing the coverage of a standard and its level of endorsement by a number of repositories should encourage its use or extension in neighbouring domains, rather than reinvention). More details in #29 too.

makxdekkers commented 5 years ago

@SusannaSansone I think that the 'vocabularies' mentioned in principle I2 can include both 'object vocabularies' -- the set of values to be used for a metadata element -- and 'predicate vocabularies' -- the set of descriptors or properties in a standard or schema. I see that the FAIRsharing registry includes both types of vocabularies, e.g. listing the AGRIS Application Profile alongside the AGROVOC controlled vocabulary. Are you proposing that the indicator for FAIRness of vocabularies, i.e. the second one at https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24#issue-459853125, should be a strong requirement, meaning that data can't be FAIR if the vocabularies used are not fully FAIR?

bahimc commented 5 years ago

Please find the current version of the indicator(s) and their respective maturity levels for this FAIR principle. Indicators and maturity levels will be presented, as they stand, to the next working group meeting for approval. In the meantime, any comments are still welcomed.

The editorial team will now concentrate on weighing and prioritizing these indicators. More information soon.

SusannaSansone commented 5 years ago

@SusannaSansone I think that the 'vocabularies' mentioned in principle I2 can include both 'object vocabularies' -- the set of values to be used for a metadata element -- and 'predicate vocabularies' -- the set of descriptors or properties in a standard or schema. I see that the FAIRsharing registry includes both types of vocabularies, e.g. listing the AGRIS Application Profile alongside the AGROVOC controlled vocabulary. Are you proposing that the indicator for FAIRness of vocabularies, i.e. the second one at #24 (comment), should be a strong requirement, meaning that data can't be FAIR if the vocabularies used are not fully FAIR? @makxdekkers I am saying that we need to be consistent with any standards ('object vocabularies' =model/formats, vocabularies etc) and clearly we use different labels and definitions. But also what exactly means a 'FAIR vocabulary?' We need to define this

makxdekkers commented 5 years ago

@SusannaSansone You are right, there is no consensus yet on how deep FAIRness for vocabularies needs to go. In the discussion above, we can see different opinions on what FAIRness means for vocabularies: @GCoen1 : "There needs to be some flexibility in the transition period while FAIR policies and guidelines for KOS can be developed, adopted and implemented but ultimately all digital objects should reach 100% FAIRness, including KOS." I take this to mean that ultimately all KOS should meet all FAIR criteria. @markwilkinson : "to me, this means resolvable to something that is machine-processable", which is a less demanding requirement. @micheldumontier : "it would be a good for maturity indicator to recognize that the all or nothing might discourage the use of not fully FAIR vocabularies, rather than engaging in a discussing with them about what they could do to improve the FAIRness of the vocabulary", being more of a route to FAIRness over time. What is your opinion?

SusannaSansone commented 5 years ago

@makxdekkers to me 'standard vocabulary' is one that has been either defined and used by a given community, and for this it has become a de fact standard, or created by a standard organization, and for this is is a de jure standard. For me a 'FAIR vocabulary' or more generally 'FAIR standard' (see #14 for my definition of community standards: minimal reporting requirements; terminologies; models/formats) is one that is findable (e.g. listed in FAIRsharing), accessible (e.g. it has a licence so that I know if and how I can extend it), interoperable and reusable (e.g. machine-processable, as Mark says, in one of the many metaformats). I also agree with Michel that not all vocabularies, and not all other standards are fully FAIR, so we need to leave a level of flexibility to start with.

makxdekkers commented 5 years ago

Peter Wittenburg at https://docs.google.com/spreadsheets/d/1mkjElFrTBPBH0QViODexNur0xNGhJqau0zkL4w8RRAw/edit?disco=AAAADadg-U0

On I2-01D and I2-02D: Again in the case of big data (structured sequences of numbers) this does not make sense or it is explained what is meant.

ylefranc commented 5 years ago

Hi Makx, Hi all,

This discussion is really interesting !!! I totally agree with @SusannaSansone that we should start defining what a FAIR vocabulary/ontology is. Currently a lot of existing semantic artefacts (thesauri, ontologies, controlled vocabularies,...) are definitively not FAIR. In parallel of defining what is FAIR Semantics, I strongly believe that we should build simple recommendations for vocabulary and ontology developers to support the creation of FAIR-by-design ontologies or "born FAIR" ontologies. For this purpose, we are organising a co-located workshop at RDA P14 (https://www.fairsfair.eu/events/building-data-landscape-future-fair-semantics-and-fair-repositories) to brainstorm on these issues and also discuss how to evaluate the FAIRness of vocabularies/ontologies. Registration to this event is open. It would be great to have your inputs.

makxdekkers commented 5 years ago

@ylefranc Simple recommendations for FAIRness of the development of new semantic artefacts would be really helpful. However, the main problem as I see it at this point in time is that people use semantic artefacts (thesauri, ontologies, controlled vocabularies, etc.) for which it is not clear how FAIR they are. This group has not reached consensus on which FAIR principles are the most important for semantic artefacts to satisfy (see https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/24#issuecomment-518018369). Maybe your work could also look at how existing semantic artefacts could be made FAIR(er)?

bahimc commented 5 years ago

Dear contributors,

Below you can find the indicators and their maturity levels in their current state as a result of the above discussions and workshops.

Please note that this thread is going to be closed, within a short period of time. The current state of the indicators, as of early October 2019, is now frozen, with the exception of the indicators for the principles that are concerned with ‘richness’ of metadata (F2 and R1). The current indicators will be used for the further steps of this WG, which are prioritisation and scoring. Later on, they will be used in a testing phase where owners of evaluation approaches are going to be invited to compare their approaches (questionnaires, tools) against the indicators. The editorial team, in consultation with the Working Group, will define the best approach to test the indicators and evaluate their soundness. As such, the current set of indicators can be seen as an ‘alpha version’. In the first half of 2020, the indicators may be revised and improved, based on the results of the testing. If you have any further comments, suggestions regarding that specific discussion, please share them with us. Besides, we invite you to have a look at the following two sets of issues.

Prioritisation

• Indicators prioritisation for Findability • Indicators prioritisation for Accessibility • Indicators prioritisation for Interoperability • Indicators prioritisation for Reusability

Scoring

• Indicators for FAIRness | Scoring We thank you for your valuable input!

GCoen1 commented 4 years ago

@nichtich See conversation above. Determining indicators could be considered like the top-down approach for FAIR verification. Together with @ylefranc who is also in the GO INTER group we work the other way to see how 'semantic artefacts' could be developed/modified to be FAIR which has direct consequences for registries and repositories of these resources. I think it is more useful to go together through the newly published recommendations by Yann for FAIRsFAIR - they are v1.0 of a planned 3 versions.

rd-alliance / FAIR-data-maturity-model-WG