rd-alliance / FAIR-data-maturity-model-WG

https://www.rd-alliance.org/group/fair-data-maturity-model-wg/case-statement/fair-data-maturity-model-wg-case-statement
13 stars 3 forks source link

Knowledge representation #14

Closed makxdekkers closed 4 years ago

makxdekkers commented 5 years ago

What should be expected from knowledge representation systems in terms of syntax and semantics? How can knowledge representation systems (code lists, controlled vocabularies, ontologies) help or hinder FAIRness?

keithjeffery commented 5 years ago

Formal declared semantics are a great assistance to FAIRness since their use improves relevance and recall (to use old fasioned Informaiton retrieval concepts). Essential for each of F,A,I,R. Moreover, while simple vocabularies can be adequate for some purposes, formal ontological structures (not necessarily in an ontology of the W3C/RDF kind) can improve greatly F (use of related terms including mutlilinguality), I (with crosswalks between terminology structures) and R.

SusannaSansone commented 5 years ago

This too links nicely with the content of the RDA FAIRsharing WG registry, which is now one of the formally approved RDA outputs.

As detailed at #29, domain/discipline-specific community standards already define their own terminologies (from CVs to ontologies that provide definitions and unambiguous identification for concepts and object; see here), especially to formalize knowledge in datasets.

makxdekkers commented 5 years ago

@SusannaSansone As far as I can see, there is no explicit mention in any of the FAIR principles to test for the use of terminologies (CV/ontologies) that are commonly used in a community. This seems to be implicit in both I1 ("... shared, and broadly applicable language ...") and R1.3 ("... domain-relevant community standards ...". Should we consider adding an indicator for this requirement to use terminologies that are common for a community?

SusannaSansone commented 5 years ago

@makxdekkers as I detail in at #29 many communities consider common terminologies part of the community standards.

makxdekkers commented 5 years ago

Thank you @SusannaSansone. It seems to me, then, there is no need for a separate indicator for this.

SusannaSansone commented 5 years ago

@makxdekkers indeed but it needs clarification that for community standards we mean terminologies, models, formats etc...Again, we may need a glossary because (as commented in other parts) we use different labels and definitions.

makxdekkers commented 5 years ago

@SusannaSansone Good point. Would you be able to propose a list of terms for which we need to agree definitions?

rwwh commented 5 years ago

Lacking any formal training in computer science, I have always tried to explain formal language for knowledge representation at a slightly less formal level as:

any format used for representing data that does not leave any ambiguity as to the meaning of the data.

This could e.g. be full-fledged RDF, but it may also be a standardized domain-specific data format that has all (meta)data fields very well defined.

This again may be dependent on the context: when health data and climate data are combined in an interdisciplinary study, the field "temperature" which may be unambiguous in either field may suddenly need more explanation (body temperature vs ambient temperature).

SusannaSansone commented 5 years ago

@SusannaSansone Good point. Would you be able to propose a list of terms for which we need to agree definitions?

@makxdekkers unfortunately there is no widely agreed glossary. I can only report on the one used by FAIRsharing, which classify community standards as:

Minimal reporting requirements are usually textual doc or lists. Terminologies and models/formats are machine readable and expressed in one or more metaformat (XML, DRF, TAB etc).

makxdekkers commented 4 years ago

@rwwh @SusannaSansone

I note that both of you are co-authors of the recent article Annika Jacobsen at al., FAIR Principles: Interpretations and Implementation Considerations.

In the Guidelines document, I added this comment.

_I'd like to note that in the latest article https://doi.org/10.1162/dint_r_00024 a clarification is given that basically makes 'knowledge representation' just about the language that is used, and it gives RDF as example. It says nothing about the 'payload' of RDF, i.e. the classes and properties that are used within RDF. Also, the idea of 'reporting guidelines' seems to be more related to 'minimal information models' to which the article refers under principles F2 and R1.3. My worry is that if we define knowledge representation in the indicators differently than the FAIR authors, we're redefining the principles, which is not in our charter._

As you are members of the group of FAIR authors, I would very much appreciate your views.

rwwh commented 4 years ago

In the call yesterday @markwilkinson identified @micheldumontier as the best person to answer this.

My take on "formal language for knowledge representation" has been to tell people that this is meant to avoid all possible ambiguity. So, like said for patents, it is good if a format does not leave any room for misinterpretation for "someone skilled in the art". Hereby it should be noted that "skilled in the art" becomes harder to define for more inter-disciplinary interoperability.

Mark referred to their discussions about requiring the knowledge representation to have at least a https://en.wikipedia.org/wiki/Backus–Naur_form , but that not being sufficient. I can't comment on that since I don't have formal education in computer science.

markwilkinson commented 4 years ago

Right, so BNF ensures that a machine can unambiguously parse a message - it's a mechanism for precisely defining a syntax. It does not, however, speak to meaning. For that, we have ontologies.

So... IMO, the "formal language for knowledge representation" must be a formal syntax, combined with a shared semantic. RDF+Ontologies is one widely-used option, but there are others.

micheldumontier commented 4 years ago

agree with mark: a formal knowledge representation language articulates a machine-readable syntax and mathematical-based semantics. therefore, the information contained within can be automatically parsed by a machine, and that the content itself is amenable to automated reasoning in which new implications can be derived. BNF is just one way to express the syntax of the language, but there are others.

keithjeffery commented 4 years ago

All – I have been observing with interest. Many of you will have heard me say many times at RDA “formal syntax and declared semantics” I am happy with BNF; for me the key thing is that the syntax should be in a notation suitable for logic processing (so one can reason about the semantics carried over the syntax) Best wishes Keith


Keith G Jeffery Consultants Prof Keith G Jeffery E: keith.jeffery@keithgjefferyconsultants.co.ukmailto:keith.jeffery@keithgjefferyconsultants.co.uk T: +44 7768 446088 S: keithgjeffery

The contents of this email are sent in confidence for the use of the intended recipient only. If you are not one of the intended recipients do not take action on it or show it to anyone else, but return this email to the sender and delete your copy of it.

From: Michel Dumontier notifications@github.com Sent: 14 February 2020 09:46 To: RDA-FAIR/FAIR-data-maturity-model-WG FAIR-data-maturity-model-WG@noreply.github.com Cc: Keith Jeffery Keith.Jeffery@keithgjefferyconsultants.co.uk; Comment comment@noreply.github.com Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Knowledge representation (#14)

agree with mark: a formal knowledge representation language articulates a machine-readable syntax and mathematical-based semantics. therefore, the information contained within can be automatically parsed by a machine, and that the content itself is amenable to automated reasoning in which new implications can be derived. BNF is just one way to express the syntax of the language, but there are others.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/14?email_source=notifications&email_token=ADALU52F45XL5LEJAGBKXKLRCZR4ZA5CNFSM4H2ZIX22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYGZ7I#issuecomment-586181885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADALU52QPENBKOWYSIV7CITRCZR4ZANCNFSM4H2ZIX2Q.

rwwh commented 4 years ago

For me as CS Noob: how about a properly structured CSV? HDF? or specifics like a TIFF file or even BAM? Do those files satisfy this rule?

keithjeffery commented 4 years ago

@rwwh : unfortunately CSV (or any other 'file' format) does not usually conform to BNF (of course you could put a BNF statement in a cell of a spreadsheet). The key point is that the syntax should be parsable by a computer. BNF is 'behind' all modern programming languages and reltes directly to boolean logic (hence the ability to induce and deduce (probably do not need to abduce)). In the FAIR context the important thing is that the knowledge representation has formal syntax upon wich semantics can be 'loaded'. Thus (having been previously declared as the grammar) can be loaded with (ideally one would add a time ranhe (start-end) when this assrtion is true and if ou want to be really fancy add probability (which would then read it is x% probable that between timedate 1 and timedate 2 Rob Hooft is owner of Dataset D. The problem with e.g. CSV is that if you put those three terms in columns then the relationship between the columns is not explicit. This leads to ambiguity.

rwwh commented 4 years ago

@keithjeffery thank you for the explanation.

Actually this is very much in line with my reason to add the phrasing "to someone skilled in the art". I agree that just CSV is insufficient, but in some "arts" people have agreed upon ways of representing the information in a CSV file that removes possible ambiguities for them.

Such "community defined" format definitions also can define specifications for the data fields that disambiguate them "at ontology quality level" without using an ontology.

I would argue that the agreement on such a specific community specified format would be even better than generic RDF, because if it is done well it is (a) completely automatically translatable into such RDF, and (b) at the same time it allows efficient use for analysis within the field. Maybe this (a) would be my necessary-and-sufficient criteria for a knowledge representation rather than BNF.

A JPG is a reasonable example, it is so well defined that the data as well as EXIF metadata could be auto-translated into RDF, but image processing tools would be very happy if the original JPG format could be used. [I am deliberately ignoring the fact that the metadata in EXIF is rarely complete enough to be satisfying any FAIR levels, just focusing on the knowledge representation issue here]

keithjeffery commented 4 years ago

Whoops - just noticed some tet disappeared. Between 'can be loaded with' and 'ideally' there should have been: <Rob Hooft

sorry about that!

keithjeffery commented 4 years ago

It has done it again. The line should have been :

best wishes Keith
keithjeffery commented 4 years ago

aha, I se the problem; the system does not like the < or > symbols. So, I substitute [ and ] [Rob Hooft][is owner of][dataset D] best Keith

makxdekkers commented 4 years ago

@markwilkinson @rwwh @micheldumontier @keithjeffery

Thanks for your insights.

I understand from the discussion here that the main objective of principle I1 is that FAIR data uses a knowledge representation language that expresses mathematical-based semantics in a machine-readable syntax. I'd call this the "how". What is also needed for people to understand how to evaluate this, we should also indicate "what" needs to be evaluated. Following discussions at the WG meetings last Thursday 13 February, I think it would make it better understandable if it was linked to the evaluation of specific aspects, e.g.

  • metadata schema
  • controlled vocabularies
  • data models and formats

Are there other aspects that need to be considered for evaluation?