Closed makxdekkers closed 4 years ago
Formal declared semantics are a great assistance to FAIRness since their use improves relevance and recall (to use old fasioned Informaiton retrieval concepts). Essential for each of F,A,I,R. Moreover, while simple vocabularies can be adequate for some purposes, formal ontological structures (not necessarily in an ontology of the W3C/RDF kind) can improve greatly F (use of related terms including mutlilinguality), I (with crosswalks between terminology structures) and R.
This too links nicely with the content of the RDA FAIRsharing WG registry, which is now one of the formally approved RDA outputs.
As detailed at #29, domain/discipline-specific community standards already define their own terminologies (from CVs to ontologies that provide definitions and unambiguous identification for concepts and object; see here), especially to formalize knowledge in datasets.
@SusannaSansone As far as I can see, there is no explicit mention in any of the FAIR principles to test for the use of terminologies (CV/ontologies) that are commonly used in a community. This seems to be implicit in both I1 ("... shared, and broadly applicable language ...") and R1.3 ("... domain-relevant community standards ...". Should we consider adding an indicator for this requirement to use terminologies that are common for a community?
@makxdekkers as I detail in at #29 many communities consider common terminologies part of the community standards.
Thank you @SusannaSansone. It seems to me, then, there is no need for a separate indicator for this.
@makxdekkers indeed but it needs clarification that for community standards we mean terminologies, models, formats etc...Again, we may need a glossary because (as commented in other parts) we use different labels and definitions.
@SusannaSansone Good point. Would you be able to propose a list of terms for which we need to agree definitions?
Lacking any formal training in computer science, I have always tried to explain formal language for knowledge representation at a slightly less formal level as:
any format used for representing data that does not leave any ambiguity as to the meaning of the data.
This could e.g. be full-fledged RDF, but it may also be a standardized domain-specific data format that has all (meta)data fields very well defined.
This again may be dependent on the context: when health data and climate data are combined in an interdisciplinary study, the field "temperature" which may be unambiguous in either field may suddenly need more explanation (body temperature vs ambient temperature).
@SusannaSansone Good point. Would you be able to propose a list of terms for which we need to agree definitions?
@makxdekkers unfortunately there is no widely agreed glossary. I can only report on the one used by FAIRsharing, which classify community standards as:
minimal reporting requirements (checklists or templates that outline the necessary and sufficient information vital for contextualizing and understanding a digital object); examples here.
terminologies (from CVs to ontologies that provide definitions and unambiguous identification for concepts and object); examples here
models/formats (define the structure and relationship of information for a conceptual model or schema, and include transmission formats to facilitate the exchange of data between different systems); examples here
Minimal reporting requirements are usually textual doc or lists. Terminologies and models/formats are machine readable and expressed in one or more metaformat (XML, DRF, TAB etc).
@rwwh @SusannaSansone
I note that both of you are co-authors of the recent article Annika Jacobsen at al., FAIR Principles: Interpretations and Implementation Considerations.
In the Guidelines document, I added this comment.
_I'd like to note that in the latest article https://doi.org/10.1162/dint_r_00024 a clarification is given that basically makes 'knowledge representation' just about the language that is used, and it gives RDF as example. It says nothing about the 'payload' of RDF, i.e. the classes and properties that are used within RDF. Also, the idea of 'reporting guidelines' seems to be more related to 'minimal information models' to which the article refers under principles F2 and R1.3. My worry is that if we define knowledge representation in the indicators differently than the FAIR authors, we're redefining the principles, which is not in our charter._
As you are members of the group of FAIR authors, I would very much appreciate your views.
In the call yesterday @markwilkinson identified @micheldumontier as the best person to answer this.
My take on "formal language for knowledge representation" has been to tell people that this is meant to avoid all possible ambiguity. So, like said for patents, it is good if a format does not leave any room for misinterpretation for "someone skilled in the art". Hereby it should be noted that "skilled in the art" becomes harder to define for more inter-disciplinary interoperability.
Mark referred to their discussions about requiring the knowledge representation to have at least a https://en.wikipedia.org/wiki/Backus–Naur_form , but that not being sufficient. I can't comment on that since I don't have formal education in computer science.
Right, so BNF ensures that a machine can unambiguously parse a message - it's a mechanism for precisely defining a syntax. It does not, however, speak to meaning. For that, we have ontologies.
So... IMO, the "formal language for knowledge representation" must be a formal syntax, combined with a shared semantic. RDF+Ontologies is one widely-used option, but there are others.
agree with mark: a formal knowledge representation language articulates a machine-readable syntax and mathematical-based semantics. therefore, the information contained within can be automatically parsed by a machine, and that the content itself is amenable to automated reasoning in which new implications can be derived. BNF is just one way to express the syntax of the language, but there are others.
All – I have been observing with interest. Many of you will have heard me say many times at RDA “formal syntax and declared semantics” I am happy with BNF; for me the key thing is that the syntax should be in a notation suitable for logic processing (so one can reason about the semantics carried over the syntax) Best wishes Keith
From: Michel Dumontier notifications@github.com Sent: 14 February 2020 09:46 To: RDA-FAIR/FAIR-data-maturity-model-WG FAIR-data-maturity-model-WG@noreply.github.com Cc: Keith Jeffery Keith.Jeffery@keithgjefferyconsultants.co.uk; Comment comment@noreply.github.com Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Knowledge representation (#14)
agree with mark: a formal knowledge representation language articulates a machine-readable syntax and mathematical-based semantics. therefore, the information contained within can be automatically parsed by a machine, and that the content itself is amenable to automated reasoning in which new implications can be derived. BNF is just one way to express the syntax of the language, but there are others.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/14?email_source=notifications&email_token=ADALU52F45XL5LEJAGBKXKLRCZR4ZA5CNFSM4H2ZIX22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELYGZ7I#issuecomment-586181885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADALU52QPENBKOWYSIV7CITRCZR4ZANCNFSM4H2ZIX2Q.
For me as CS Noob: how about a properly structured CSV? HDF? or specifics like a TIFF file or even BAM? Do those files satisfy this rule?
@rwwh : unfortunately CSV (or any other 'file' format) does not usually conform to BNF (of course you could put a BNF statement in a cell of a spreadsheet). The key point is that the syntax should be parsable by a computer. BNF is 'behind' all modern programming languages and reltes directly to boolean logic (hence the ability to induce and deduce (probably do not need to abduce)). In the FAIR context the important thing is that the knowledge representation has formal syntax upon wich semantics can be 'loaded'. Thus
@keithjeffery thank you for the explanation.
Actually this is very much in line with my reason to add the phrasing "to someone skilled in the art". I agree that just CSV is insufficient, but in some "arts" people have agreed upon ways of representing the information in a CSV file that removes possible ambiguities for them.
Such "community defined" format definitions also can define specifications for the data fields that disambiguate them "at ontology quality level" without using an ontology.
I would argue that the agreement on such a specific community specified format would be even better than generic RDF, because if it is done well it is (a) completely automatically translatable into such RDF, and (b) at the same time it allows efficient use for analysis within the field. Maybe this (a) would be my necessary-and-sufficient criteria for a knowledge representation rather than BNF.
A JPG is a reasonable example, it is so well defined that the data as well as EXIF metadata could be auto-translated into RDF, but image processing tools would be very happy if the original JPG format could be used. [I am deliberately ignoring the fact that the metadata in EXIF is rarely complete enough to be satisfying any FAIR levels, just focusing on the knowledge representation issue here]
Whoops - just noticed some tet disappeared. Between 'can be loaded with' and 'ideally' there should have been:
<Rob Hooft
sorry about that!
It has done it again. The line should have been :
aha, I se the problem; the system does not like the < or > symbols. So, I substitute [ and ] [Rob Hooft][is owner of][dataset D] best Keith
@markwilkinson @rwwh @micheldumontier @keithjeffery
Thanks for your insights.
I understand from the discussion here that the main objective of principle I1 is that FAIR data uses a knowledge representation language that expresses mathematical-based semantics in a machine-readable syntax. I'd call this the "how". What is also needed for people to understand how to evaluate this, we should also indicate "what" needs to be evaluated. Following discussions at the WG meetings last Thursday 13 February, I think it would make it better understandable if it was linked to the evaluation of specific aspects, e.g.
Are there other aspects that need to be considered for evaluation?
What should be expected from knowledge representation systems in terms of syntax and semantics? How can knowledge representation systems (code lists, controlled vocabularies, ontologies) help or hinder FAIRness?