tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
206 stars 70 forks source link

Where to get an RDFS or OWL version of DWC ? #357

Open tfrancart opened 3 years ago

tfrancart commented 3 years ago

Does an RDFS or OWL version of DWC exists ? (containing all terms with subClassOf, domain and range definition, + labels + definitions) I can't find it. Additionnally, does a UML-like of DWC terms exist somewhere ?

Sorry if these are stupid questions, I am new to DWC (but not to RDF/OWL ;-) )

baskaufs commented 3 years ago

Hi @tfrancart. If the system is working, you should be able to acquire various bits of the Darwin Core vocabulary RDF by dereferencing their IRIs, then following links to other resources. However, that is somewhat annoying.

You can also get dumps of whole datasets at once by dereferencing "dump" IRIs as described here. So for example, you can get the RDF about terms in the main dwc: namespace using:

http://rs.tdwg.org/dump/terms

with content negotiation or directly using

http://rs.tdwg.org/dump/terms.rdf

or

http://rs.tdwg.org/dump/terms.ttl

The other alternative is to acquire the data you want using a SPARQL query at https://sparql.vanderbilt.edu/, which currently has all of the TDWG standards metadata loaded. For example, this query:

prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/>
select distinct ?termList ?label ?term
from <http://rs.tdwg.org/>
where {
  <http://www.tdwg.org/standards/450> dcterms:hasPart ?vocabulary.
  ?vocabulary a tdwgutility:Vocabulary.
  ?vocabulary dcterms:hasPart ?termList.
  ?termList dcterms:hasPart ?term.
  ?term skos:prefLabel ?label.
  filter(lang(?label) = "en")
  }
order by ?termList

will list all terms (borrowed and minted) from all vocabularies included in the Darwin Core standard.

There is a series of blog posts describing these three methods in detail: content negotiation SPARQL queries dataset dumps

You asked about RDFS/OWL. By design, terms defined by TDWG do not include the semantics you mentioned (subClassOf, domain, range). TDWG vocabularies are "bags of terms" with only simple properties that do not generate entailments. Additional semantic layers could be added as vocabulary "enhancements" but so far that has not happened for any TDWG vocabulary that conforms to the Standards Documentation Specification(SDS) (that is, the Darwin Core and Audubon Core vocabularies). See section 4.4.2.2 of the SDS for details.

There has been a lot of discussion recently about how Darwin Core terms should be grouped within the main Darwin Core classes. But so far there has not been a consensus about that and there is not yet a consensus graph/data model. So until that happens, there are not likely to be additional "enhancement" layers with range and domain declarations.

I hope I have answered your questions. Please feel free to follow up if I need to clarify anything about what I said.

tfrancart commented 3 years ago

Hi @tfrancart. If the system is working, you should be able to acquire various bits of the Darwin Core vocabulary RDF by dereferencing their IRIs, then following links to other resources. However, that is somewhat annoying.

Agreed, hence my question ;-)

You can also get dumps of whole datasets at once by dereferencing "dump" IRIs as described here. So for example, you can get the RDF about terms in the main dwc: namespace using:

http://rs.tdwg.org/dump/terms

with content negotiation or directly using

http://rs.tdwg.org/dump/terms.rdf

This works

or

http://rs.tdwg.org/dump/terms.ttl

This works. Thanks.

The other alternative is to acquire the data you want using a SPARQL query at https://sparql.vanderbilt.edu/, which currently has all of the TDWG standards metadata loaded. For example, this query:

prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix dcterms: <http://purl.org/dc/terms/>
prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/>
select distinct ?termList ?label ?term
from <http://rs.tdwg.org/>
where {
  <http://www.tdwg.org/standards/450> dcterms:hasPart ?vocabulary.
  ?vocabulary a tdwgutility:Vocabulary.
  ?vocabulary dcterms:hasPart ?termList.
  ?termList dcterms:hasPart ?term.
  ?term skos:prefLabel ?label.
  filter(lang(?label) = "en")
  }
order by ?termList

will list all terms (borrowed and minted) from all vocabularies included in the Darwin Core standard.

There is a series of blog posts describing these three methods in detail: content negotiation SPARQL queries dataset dumps

Thank you, this is helpful. Are the download links for the dumps referenced somehwere from https://www.tdwg.org/standards/dwc/ ?

You asked about RDFS/OWL. By design, terms defined by TDWG do not include the semantics you mentioned (subClassOf, domain, range). TDWG vocabularies are "bags of terms" with only simple properties that do not generate entailments. Additional semantic layers could be added as vocabulary "enhancements" but so far that has not happened for any TDWG vocabulary that conforms to the Standards Documentation Specification(SDS) (that is, the Darwin Core and Audubon Core vocabularies). See section 4.4.2.2 of the SDS for details.

I assumed that since the terms specification at https://dwc.tdwg.org/list/ is grouped by classes (Occurrence, Organism, etc.) and that IRI-valued term clearly state domain/ranges in their definitions (e.g. on <http://rs.tdwg.org/dwc/iri/toTaxon : "Use to link a dwc:Identification instance subject to a taxonomic entity such as a taxon"...) the equivalent domain-range declarations would be present. If I may, the definitions as they are written are misleading because they clearly use an IRI identifier ("Use to link a dwc:Identification instance ...") but this is not reflected in the formal semantic. If there is no such formal semantic, then the definition should use a more generic wording, such as "Use to link a ressource to a taxonomic entity such as a taxon").

There has been a lot of discussion recently about how Darwin Core terms should be grouped within the main Darwin Core classes. But so far there has not been a consensus about that and there is not yet a consensus graph/data model. So until that happens, there are not likely to be additional "enhancement" layers with range and domain declarations.

This clarifies a lot about what DWC is and is not, many thanks ! I find it weird to have defined clear list of classes and clear list of properties, but not the relations between the classes and properties... I hope the discussions within the community will continue toward a (minimal) graph/data model.

I hope I have answered your questions. Please feel free to follow up if I need to clarify anything about what I said.

baskaufs commented 3 years ago

Thank you, this is helpful. Are the download links for the dumps referenced somehwere from https://www.tdwg.org/standards/dwc/ ?

I don't think so. There needs to be more work on making sure that pages on the TDWG website are linked to important places. I think the main starting point is the landing page for the rs.tdwg.org GitHub repo. Most people aren't aware of the details of how the TDWG metadata are generated, but it all comes from tables in this repo. This table is the authoritative list of datasets and it is used to generate http://rs.tdwg.org/index.rdf and http://rs.tdwg.org/index.ttl, which can be used by a linked data client to "follow its nose" to all of the dataset dumps.

I assumed that since the terms specification at https://dwc.tdwg.org/list/ is grouped by classes (Occurrence, Organism, etc.) and that IRI-valued term clearly state domain/ranges in their definitions (e.g. on <http://rs.tdwg.org/dwc/iri/toTaxon : "Use to link a dwc:Identification instance subject to a taxonomic entity such as a taxon"...) the equivalent domain-range declarations would be present. If I may, the definitions as they are written are misleading because they clearly use an IRI identifier ("Use to link a dwc:Identification instance ...") but this is not reflected in the formal semantic. If there is no such formal semantic, then the definition should use a more generic wording, such as "Use to link a ressource to a taxonomic entity such as a taxon").

I admit that this is a deficiency, but the DwC RDF guide is really just a first step towards making it possible to use DwC as RDF. There is no underlying graph model to link the main classes, although the new task group that is forming might eventually clarify that situation). There are really only a few terms, mostly in Table 3.6 of the RDF guide where these sorts of "domain/range" relationships are spelled out. Those linking properties that don't have analogs in the dwc: namespace were minted as an alternative to trying to describe linked resources using a long list of the "convenience" properties. The whole thing is really a hack because we are starting with a vocabulary that was really designed for spreadsheets or other tabular data forms.

The decision to not include formal semantics (i.e. metadata properties that generate entailments like subclass, subdomain, range, domain) was intentional. It was made for two reasons: because in many cases people couldn't agree on what the values of those properties should be, and because what people really wanted was a way to restrict term use and not what you often actually achieve with these terms (generating entailed triples that are not what people intended or wanted). People who actually want to control how Darwin Core properties are used would probably be better off developing Shapes Expressions (ShEx) profiles rather than assigning domains and ranges.

I think the really big picture is that Darwin Core is a very general-use vocabulary that is used by many people (most?) who don't care at all about RDF, Linked Data, Semantic Web, etc. So it's difficult to develop a vocabulary that simultaneously meets the needs of those people, while also doing what Linked Data people want. It's not impossible, but challenging.

tfrancart commented 3 years ago

The decision to not include formal semantics (i.e. metadata properties that generate entailments like subclass, subdomain, range, domain) was intentional. It was made for two reasons: because in many cases people couldn't agree on what the values of those properties should be

I admit agreeing on the values could be difficult, but what about the domains ? I see other places in DWC, like the DWC XML schemas, where an explicit relation between classes and properties is made. Quoting https://dwc.tdwg.org/xml/:

Many Darwin Core terms (properties) are defined as being associated with another term (a class). For example, scientificName and Taxon are both Darwin Core terms, but scientificName is a property associated with the Taxon class.

Either this statement is incorrect, because no such association between the 2 terms exist formally, or such a domain association should be made explicit in the property declaration, or I missed something. I am confused because it seems to me the DWC XML schema are defining relations between properties and classes (but I am not sure I read the XSD correctly), but the equivalent does not exists for an RDF usage use-case.

and because what people really wanted was a way to restrict term use and not what you often actually achieve with these terms (generating entailed triples that are not what people intended or wanted). People who actually want to control how Darwin Core properties are used would probably be better off developing Shapes Expressions (ShEx) profiles rather than assigning domains and ranges.

I agree; domains and ranges belong to the knowledge definition, while ShEx/SHACL belong to application-specific or workflow specific constraints. My surprise is that I read definitions and usage descriptions of DWC terms (like the ones quoted above and previously) that are not aligned to the formal definitions of the terms, from a knowledge definition perspective (and not from an application-specific or data-specific perspective)

baskaufs commented 3 years ago

There are two things worth considering here. One is that the XML guide was written prior to the RDF guide. So the XML guide was not written with any RDF considerations in mind as far as I know. So I would not make any assumptions about implications for RDF based on what the XML guide says.

The second thing is that nothing prohibits a person or a group from assigning ranges, domains, subclass relationships, etc. to Darwin Core terms. What the relevant guidelines say is that these entailment-generating assertions should be added as a layer on top of the basic "bag of terms" layer that includes only basic metadata such as definition, label, examples, and notes. This would allow groups to test the utility of those assertions without imposing them on others who might have conflicting views on how relationships should be modeled or those who simply don't care because they are using spreadsheets. There is a process, as of yet unused in Darwin Core, for officially adding these higher-level layers to Darwin Core as part of a "vocabulary enhancement". But along with this ability comes a responsibility to show that the addition is both needed and also that it does something useful that requires it (the "demand" and "efficacy" requirements). The relevant specifications governing this are the Standards Documentation Specification sections 4.4.2.2 and examples in 4.4.2.3 and the Vocabulary Maintenance Specification section 4.

To delve into a few more details, the Darwin and Audubon Core terms are organized into groups in ways that help people better understand how to use them. This may include grouping terms under a class name to suggest that those terms would be appropriate properties to be used with instances of that class. However, this grouping is considered a suggestion, not a requirement and from time to time we see discussion around moving particular terms from one group to another. Formally, this grouping is achieved by the term tdwgutility:organizedInClass (http://rs.tdwg.org/dwc/terms/attributes/organizedInClass). See http://rs.tdwg.org/dwc/terms/recordedBy.ttl for an example. This organization is not normative and mostly serves as a way for the software that generates human-readable documentation to know how to place the terms into categories. In some cases (particularly in Audubon Core), the "classes" within which terms are organized are artificial ones with no practical semantic meaning. So there are actually machine-readable connections between terms and the classes they are organized within. But that connection does not have the effect you'd like to create using domain declarations.

With respect to the problems associated with the xID terms, a problem arises with their existing use as a means to designate both the primary and foreign keys for a record. In the example given in section 2.7.1 of the XML guide, let's say we would like for dwc:identificationId to link occurrences to identifications, so we assign domain occurrence and range identification. We would also like dwc:taxonID to link identifications to taxa, so we assign domain identification and range taxon. Applying these properties to the identification instance shown in example 2.7.1 would be something like this (in RDF/Turtle):

_1:identification dwc:identificationID <http://guid.mvz.org/identifications/23459>.
_1:identification dwc:taxonID <urn:lsid:catalogueoflife.org:taxon:d79c11aa-29c1-102b-9a4a-00304854f820:col20120721>.

As I've defined the two xID properties, the domain declarations would entail that

_1:identification a dwc:Occurrence.
_1:identification a dwc:Identification.

Perhaps those domain declarations are wrong, but if that isn't what they should be, then what should they be?

The point is that although xID terms are included in Darwin Core, their meaning when used as RDF is unclear. People intend for them to mean "when I use property xID in my RDB or spreadsheet, I want this column to contain an identifier for an instance of class x". What that column actually means depends on the context. If the table is "about" identifications, then we want dwc:identificationID to be a primary key. If the table is "about" occurrences, then we want dwc:identificationID to be a foreign key. That kind of fuzzy definition isn't suitable for use in RDF, so that's why the RDF guide says not to attempt to use the xID terms as object properties to link two classes. The xID terms are intended to solve a table-related problem, not an RDF-related problem.