Indicators for R1.2: (meta)data are associated with detailed provenance

makxdekkers commented 5 years ago

Points raised in online meeting 3 on 18 June 2019

Context information is needed (domain-specific).
Provenance is critical; the (re)user needs to know who the author is and how to reach him/her. Provenance includes information on how the dataset was generated (calibration, methodology etc.), source and lineage, versioning, project and/or activity in the framework of which the data was produced.
The end-user (or a software agent acting on their behalf) needs to know if there is machine-readable or machine-understandable provenance information since this is essential for contextualisation (relevance, quality) of the asset.
R1.2 is very important for the long term.

keithjeffery commented 5 years ago

To add a litle more: there are many ways of recording provenance such that it can be managed autonoically. The W3C PROV recommendations are not the only way. In fact, provenance information in data models other than PROV has been used for a long time in many research domains since t is commonly critical for evaluation of the re-usability (relevance, quality) of the asset. For many instances PROV is insufficient; commonly researchers need to know - in the provenance information - not only that the asset was accessed/modified but the wider context including e.g. observational or experiemntal equipment used, its parameters (accuracy, precision, calibration), associated methodology (lab notebok, observing diary), links to relevant publications (grey as well as white).....

makxdekkers commented 5 years ago

@keithjeffery I am not sure the aspects you mention cannot be satisfied using PROV. As I understand it, PROV is very flexible with its Expanded and Qualified terms and might be able to express all of that. On the other hand, I think no-one is proposing (yet) for an indicator to reference PROV-O specifically.

How could an indicator be formulated? Could it enumerate some critical provenance items (like the ones you list), or should we link to existing standards/guidelines that could form the basis for the indicator? If so, which standard/guidelines would be candidates for such a reference?

keithjeffery commented 5 years ago

Makx – My preference always is not to be prescriptive (defining which standard to use) but analytical (defining which constraints have to be satisfied) – after all there are ‘many ways to skin a cat’ and whatever method is used is more-or-less irrelevant as long as it meets the objectives. So I would say a mechanism for provenance has to satisfy certain (to be defined) criteria. Best Keith

Keith G Jeffery Consultants Prof Keith G Jeffery E: keith.jeffery@keithgjefferyconsultants.co.ukmailto:keith.jeffery@keithgjefferyconsultants.co.uk T: +44 7768 446088 S: keithgjeffery

The contents of this email are sent in confidence for the use of the intended recipient only. If you are not one of the intended recipients do not take action on it or show it to anyone else, but return this email to the sender and delete your copy of it.

From: makxdekkers notifications@github.com Sent: 01 July 2019 16:50 To: RDA-FAIR/FAIR-data-maturity-model-WG FAIR-data-maturity-model-WG@noreply.github.com Cc: Keith Jeffery Keith.Jeffery@keithgjefferyconsultants.co.uk; Mention mention@noreply.github.com Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Indicators for R1.2: (meta)data are associated with detailed provenance (#28)

@keithjefferyhttps://github.com/keithjeffery I am not sure the aspects you mention cannot be satisfied using PROV. As I understand it, PROV is very flexible with its Expanded and Qualified terms and might be able to express all of that. On the other hand, I think no-one is proposing (yet) for an indicator to reference PROV-O specifically.

How could an indicator be formulated? Could it enumerate some critical provenance items (like the ones you list), or should we link to existing standards/guidelines that could form the basis for the indicator? If so, which standard/guidelines would be candidates for such a reference?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/28?email_source=notifications&email_token=ADALU5Z3RBDMLZVT3RNCUF3P5IRSLA5CNFSM4H257LGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6R2JA#issuecomment-507321636, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADALU54ETUHJIGTKNNVZD5TP5IRSLANCNFSM4H257LGA.

makxdekkers commented 5 years ago

@keithjeffery Let's see if there are suggestions for those criteria form others in the WG.

micheldumontier commented 5 years ago

what we expect is that communities identify what provenance information are crucial to the understanding of the digital resource. of course, we can expect general properties (e.g. who created the resource, when it was created, etc), but there will also be provenance specific to the kind of object (e.g. which instrument was used, what chip array was used, what detector was used). We expect that in many cases communities have already specified elements of provenance in their own data formats... FAIR then simply asks that it be mapped to more general purpose provenance languages such as PROV.

makxdekkers commented 5 years ago

@micheldumontier Are you suggesting that an indicator be added for the mapping of object-specific or domain-specific provenance items to a more general-purpose provenance language? E.g.

Mapping of object-specific or domain-specific provenance information

NO mapping to general purpose provenance language
Mapping to general purpose provenance language (e.g. PROV-O)

markwilkinson commented 5 years ago

I think we need to be very very careful about being prescriptive (either negatively, or positively) about any metadata element, when acting as a high-level working group. As I said during the call, a piece of ancient pottery doesn't have an author. Nor does a mammoth fossil. Nor does an animal in a zoo. Relevant metadata elements cannot be predicted, and therefore IMO, should not be within the scope of a high-level working group.

If I were to "invent a metric" for R1.2 (which I have been avoiding!! ...BECAUSE I think it is absolutely none of my business to do so! It's a community-level task!).... I would design something like this:

1) Collect all of the traditional provenance-style metadata elements (DC, DCT, DCAT, PROV, etc.), and then do a count of how many of these are being used by the Resource - a larger number is "better"

2) Of the remaining metadata elements used by the resource, do a profile of how many ontologies are being used (both in predicates and in object-position) in this metadata - where a larger number is "better".

3) A given community, using their own internal use-cases, can come to some decision about what those numbers should be, to represent "pass" vs "fail" in their context.

v.v. mapping: I like the idea of mapping, though I'm loathe to encourage communities to continue to create new vocabularies that represent existing concepts. There's also the problem with providing a common way for agents to discover mappings - so then we end up (potentially) inventing new standards for how to publish mapping resources... which the communities then have to build (and may not have the expertise to do so, depending how they are implemented. Mapping isn't really a trivial problem - just ask those who have spent their careers doing schema-mapping in databases and XML ;-) ) Nevertheless, if we had mapping-made-easy (something similar to what identifiers.org does for mapping between GUIDs of the same thing in different databases) then I am OK with this idea. Anything harder than that, I suspect would not be sustainable. (it isn't even clear if identifiers.org is sustainable)

keithjeffery commented 5 years ago

@Mark - I agree that a piece of potery does not have an author but it has relaitonships with persons: the creator (maybe unknowm), the finder, the curator, the owner (maybe), with organisations (e.g. the museum), with documents (e.g. scholarly paper or grey literature) and so on all of which can be expressed in rich metadata. I suggest we must stop thinking in terms of attributes or properties of an asset described as metadata (DC-think) and more of relationships around an asset decribed as metadata. This is what RDF tries to do (encoded as Turtle or something similar).

makxdekkers commented 5 years ago

@markwilkinson I understand your reluctance to prescribe a particular set of provenance descriptors, because it very much depends on the type of resource and the community in which the resource is used. In that sense, it could be left to community-specific guidelines. This creates maximum potential for reuse within that community. In addition, asking for mapping -- as much as possible and relevant -- to a general-purpose provenance ontology could be useful for potential cross-domain reuse. It's true that mapping is not trivial, but even if the mapping is incomplete and lossy, it could still be helpful.

It seems to me that the indicators given in the first comment above, which were based on the contributions in the collaborative document, are probably too specific. Maybe we could propose two new ones:

R1.2-01 Provenance information based on community-specific guidelines relevant for the resource

NOT based on community-specific guidelines
Based on community-specific guidelines

and

R1.2-02 Mapping of object-specific or domain-specific provenance information to a cross-domain language

NO mapping to general purpose provenance language
Mapping to general purpose provenance language (e.g. PROV-O)

keithjeffery commented 5 years ago

@Makx - I would be content with your proposal as long as the last bullet is not prescriptive (fashion in standards changes with time - right now PROV is popular but there are oher general mechanisms)

makxdekkers commented 5 years ago

@keithjeffery The last bullet says 'e.g.' so it's not prescriptive. Would you have another example that could be included alongside PROV-O?

keithjeffery commented 5 years ago

@Makx - Agreed. In EPOS we use CERIF of course for all aspects of metadata (discovery, contextualisation, curation, provenance) but I am not pishing for its inclusion. I just wanted to ensure that we (as we have elsewhere) avoid being (or being seen to be) prescriptive.

markwilkinson commented 5 years ago

@keithjeffery absolutely. I was suggesting exactly the same thing. We need "a thick cloud" of metadata, but we cannot pre-determine what that cloud is composed of (and shouldn't try!)

SusannaSansone commented 5 years ago

This discussion also links nicely with the content of the RDA FAIRsharing WG registry, which is now one of the formally approved RDA outputs.

As detailed in #29, many domain/discipline-specific community standards (for representing/reporting digital objects) already contain some provenance, both general information and specific one to the kind of object (who created, when and how, etc...what technology was used, what analytical method etc); these community are not using PROV. Adding R1.2-02 would be too specific.

makxdekkers commented 5 years ago

@SusannaSansone R1.2-02 tries not to be too specific -- it contains a reference to PROV-O only as an example. The objective was to try to encourage mapping from domain-specific approaches to more general approaches so that people in other domains can also understand the provenance information. It might indeed be that such a requirement is difficult to satisfy. However, cross-domain reusability will increase if such a mapping is provided.

bahimc commented 5 years ago

Please find the current version of the indicator(s) and their respective maturity levels for this FAIR principle. Indicators and maturity levels will be presented, as they stand, to the next working group meeting for approval. In the meantime, any comments are still welcomed.

The editorial team will now concentrate on weighing and prioritizing these indicators. More information soon.

SusannaSansone commented 5 years ago

@SusannaSansone R1.2-02 tries not to be too specific -- it contains a reference to PROV-O only as an example. The objective was to try to encourage mapping from domain-specific approaches to more general approaches so that people in other domains can also understand the provenance information. It might indeed be that such a requirement is difficult to satisfy. However, cross-domain reusability will increase if such a mapping is provided.

@makxdekkers I understand this "from domain-specific approaches to more general approaches" but then it has to be clear that this only refers to general approaches, because there are many community-specific (that can also implies domain/discipline specific) models/formats (expressed in one or more of metamodels, XML, TAB etc) that include provenance information (without using PROV). Just to pick one example: https://doi.org/10.25504/FAIRsharing.s51qk5

makxdekkers commented 5 years ago

@SusannaSansone Indicator R1.2-01M is indeed about provenance information according to community-specific guidelines or standards. Is that not sufficiently clear? If not, how could it be formulated better?

SusannaSansone commented 5 years ago

@SusannaSansone Indicator R1.2-01M is indeed about provenance information according to community-specific guidelines or standards. Is that not sufficiently clear? If not, how could it be formulated better?

@makxdekkers if you just say "provenance information according to community-specific guidelines or standards" is ok. My comment was on the example of PROV, which some domain-specific community-specific standards do not use, yet these capture provenance information.

bahimc commented 5 years ago

Dear contributors,

Below you can find the indicators and their maturity levels in their current state as a result of the above discussions and workshops.

Please note that this thread is going to be closed, within a short period of time. The current state of the indicators, as of early October 2019, is now frozen, with the exception of the indicators for the principles that are concerned with ‘richness’ of metadata (F2 and R1). The current indicators will be used for the further steps of this WG, which are prioritisation and scoring. Later on, they will be used in a testing phase where owners of evaluation approaches are going to be invited to compare their approaches (questionnaires, tools) against the indicators. The editorial team, in consultation with the Working Group, will define the best approach to test the indicators and evaluate their soundness. As such, the current set of indicators can be seen as an ‘alpha version’. In the first half of 2020, the indicators may be revised and improved, based on the results of the testing. If you have any further comments, suggestions regarding that specific discussion, please share them with us. Besides, we invite you to have a look at the following two sets of issues.

Prioritisation

• Indicators prioritisation for Findability • Indicators prioritisation for Accessibility • Indicators prioritisation for Interoperability • Indicators prioritisation for Reusability

Scoring

• Indicators for FAIRness | Scoring We thank you for your valuable input!

rd-alliance / FAIR-data-maturity-model-WG

Indicators for R1.2: (meta)data are associated with detailed provenance #28

Keith G Jeffery Consultants Prof Keith G Jeffery E: keith.jeffery@keithgjefferyconsultants.co.ukmailto:keith.jeffery@keithgjefferyconsultants.co.uk T: +44 7768 446088 S: keithgjeffery

The contents of this email are sent in confidence for the use of the intended recipient only. If you are not one of the intended recipients do not take action on it or show it to anyone else, but return this email to the sender and delete your copy of it.