tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
206 stars 70 forks source link

Bring metadata in rs.tdwg.org up to date with the "Normative Document" #252

Closed baskaufs closed 4 years ago

baskaufs commented 4 years ago

As of 2020-07-09, the metadata in the master branch of rs.tdwg.org differs from the Normative Document in the following ways:

baskaufs commented 4 years ago

Regarding the technical difference involving http://dublincore.org/usage/terms/history/#rightsT-001: Currently the "Normative Document" says that http://dublincore.org/usage/terms/history/#license-002 replaces http://dublincore.org/usage/terms/history/#license-001. This is correct as far as the assertions made by Dublin Core are concerned. According to the Dublin Core historical record, license-002 replaced license-001 in 2008.

However, the TDWG Executive Decision http://rs.tdwg.org/decisions/decision-2014-11-06_17 deprecated the Dublin Core rights term in favor of the Dublin Core license term in 2014. So from the standpoint of Darwin Core, license-002 replaced rightsT-001, not license-001. (license-001 was never part of DwC.) The authoritative Dublin Core metadata will still make the assertion currently in the "Normative Document", but I think the replacement made by the Exec in 2014 is the one that actually needs to be included in the metadata provided by TDWG. That's consistent with the way we use dcterms:replaces in every other circumstance and would be necessary information for any application that was trying to use machine readable metadata from TDWG to interpret old DwC datasets that used terms that are currently deprecated.

A secondary consequence of this is that http://dublincore.org/usage/terms/history/#rightsT-001 should be included in the "Normative Document" with a status of deprecated. It's already there in the rs.tdwg.org` data and since deprecated versions have no effect on generating the Quick Reference Document, adding it to the "Normative Document" should not cause any problems.

baskaufs commented 4 years ago

Regarding the technical difference involving http://dublincore.org/usage/terms/history/#language-007 vs. http://dublincore.org/usage/terms/history/#languageT-001. http://dublincore.org/usage/terms/history/#languageT-001 is the version for dcterms:language, while http://dublincore.org/usage/terms/history/#language-007 is the most recent version for dc:language.

The current description in the Quick Reference Guide (generated from the Normative Document) provides the following guidance for using dcterms:language: "Recommended best practice is to use RFC 5646 as a controlled vocabulary." with examples "en (for English), es (for Spanish)". However, the examples are inconsistent with usage of the dc: and dcterms: namespaces as outlined in table 3.3. of the DwC RDF Guide, which specifies that the values of dcterms: terms should have non-literal (i.e. IRI) values and that dc: terms should be used with string literals. If dcterms:language is actually used, its value (as recommended by the RDF guide) should be a MARC ISO 639-2 language IRI. This recommendation is also consistent with the recommendation for use of dcterms:langage given by Audubon Core.

The other somewhat problematic issue is that Audubon Core recommends the following for dc:language: "Language(s) of resource itself represented in the ISO639-2 three-letter language code. ISO639-1 two-letter codes are permitted but deprecated." This recommendation to use 3-letter codes is somewhat at odds with the DwC examples showing two-letter codes.

The problem here is that historically, the practice for using several of the dcterms: terms has been incorrect with respect to the recommendations of Dublin Core, i.e. using dcterms: terms whose values are recommended to be non-literal with literal values. How do we deal with this without breaking existing implementations? Also, how do we make usage in Darwin Core consistent with usage in Audubon Core?

The rs.tdwg.org repo metadata currently shows dcterms:language as recommended and dc:language as also recommended but replacing dcterms:language. That's a kind of weird way to indicate that we should stop using dcterms:language in the wrong way, but it's OK to use either of them with the correct kind of value. However, the Normative Document (and therefore the Quick Reference Guide) say nothing at all about dc:language and just tell people to use dcterms:language incorrectly.

I'm not sure how this should be fixed. How often is a value for dcterms:language even provided as part of metadata? What would be broken if we just told people to use dc:language and dcterms:language correctly? In the Quick Reference Guide dc:language could be categorized in the record-level terms section and dcterms:language could be put down in the UseWithIRI section.

baskaufs commented 4 years ago

The technical issue involving http://dublincore.org/usage/terms/history/#type-006 is similar to the issue described in the previous comment. The Normative Document describes dcterms:type (http://dublincore.org/usage/terms/history/#typeT-001) and ignores dc:type (http://dublincore.org/usage/terms/history/#type-006). The examples listed in the QRG are again string literals rather that IRIs - also incorrect usage for a dcterms: term.

In this case, It seems clear to me that this should be changed so that dc:type is prescribed for use with string literals. The reason is that the kinds of resources described by type are media resources, and given that Audubon Core is the media vocabulary of TDWG, its recommendations should be primary over DwC. (The AC recommendations give the correct usage for dc:type and dcterms:type. See the term list for details.)

The rs.tdwg.org repo metadata currently has dcterms:type as deprecated with dc:type as its designated replacement. I think this makes sense because I'm not sure that there is a compelling reason why very many people would want to provide an IRI value for the term when they are probably inclined to use one of the string values. People who are really interested in RDF or linked data and want to use IRI values will probably want to use rdf:type instead of dcterms:type anyway. See Section 3.1 of the RDF guide for more on use of rdf:type.

We should consider a course of action here from the standpoint of stability and not "breaking" applications. How often is dcterms:type used outside the Audubon Core media extension? Does the AC media extension give correct advice on using the two terms? I presume so, and if that's true, then it doesn't seem like fixing this in DwC would have very severe consequences.

peterdesmet commented 4 years ago

Regarding point 4 http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06: it is used by the build script to group terms following that line under the section https://dwc.tdwg.org/terms/#usewithiri

I agree that it would be better to remove it and adapt the build script (which we might have to do anyway for https://github.com/tdwg/dwc/issues/251). It would be best to record this as a separate issue.

baskaufs commented 4 years ago

If the build script is to be re-done and the term_versions.csv file is to continue to be used for the time being to build the QRG, then http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06 could just be left. My primary concern is that if people will continue to be looking to the "normative document (term_versions.csv) to learn precisely what is and is not part of Darwin Core, it should agree exactly with the vocabulary metadata that is found elsewhere. If we stop referring to it as the go-to place for people to learn what is in the standard, then the presence of http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06 isn't much of a problem.

If you want to see what's in the http://rs.tdwg.org/dwc/terms/attributes/ namespace now, you can browse to its URI. You'll see that they are not only terms used in Darwin Core. There are now a bunch of other "made up" terms that we use in Audubon Core to indicate term properties like whether they are repeatable or not, plus new TDWG-wide terms like tdwgutility:Vocabulary. So the namespace http://rs.tdwg.org/dwc/terms/attributes/ is definitely not really a part of Darwin Core any more, despite the form of its URI. That's why we've been referring to it as the tdwgutility: namespace.

tucotuco commented 4 years ago

I'm not sure how this should be fixed. How often is a value for dcterms:language even provided as part of metadata? What would be broken if we just told people to use dc:language and dcterms:language correctly? In the Quick Reference Guide dc:language could be categorized in the record-level terms section and dcterms:language could be put down in the UseWithIRI section.

As of the 2020-04-09 snapshot of GBIF. the language term is filled in for 216998286 (15.4% of) Occurrence records, and of those, 138804401 (64%) are two-letter language codes, and 1245724 (0.6%) as three-letter language codes. The rest are fully spelled out names of languages or garbage that is not supposed to be in that field. There appear to be no IRIs in that field, and I can not find any authority that gives IRIs for langauges, though I don't have access to the actual standards ISO 639-2 or ISO 639-3. The documentation for dcterms:language even allows for a string literal if it is a language tag. In any case, the intention for every Dublin Core borrowed term was that it be the one for the string literal, and only late in the game was it realized that dcterms: was not it. They really all should be dc:, with corresponding dcterms: versions for the DwC IRI section.

baskaufs commented 4 years ago

One of the issues with the various dcterms: namespace terms is the lack of clear guidance on what authoritative source of IRIs should be used as values. What I suggest in this case is to make the DwC recommendation for dcterms:language consistent with Audubon Core. The recommendation here is to use the LOC IRIs for the ISO 639-2 three-letter codes. The LOC has several sets of controlled vocabularies for languages and I'm not sure why Bob Morris and the AC task group chose that one over the others (I wasn't on that team and as the review manager I didn't have any reason to question their choice). I think it was because the three letter codes allowed for more variants than the two-letter codes that are in more common use. In any case, it's a fait accompli and since we don't have a clear better option, I would suggest that we just be consistent with AC.

Ultimately, it would be good to have some kind of JSON or JSON-LD file that related those IRIs to the two letter codes so that the string values could be mapped to the IRIs. That hasn't happened in AC because nobody yet has indicated that they cared enough for us to expend the energy to do it. But for now, just have a vocabulary to recommend is probably good enough.

tucotuco commented 4 years ago

We should consider a course of action here from the standpoint of stability and not "breaking" applications. How often is dcterms:type used outside the Audubon Core media extension? Does the AC media extension give correct advice on using the two terms? I presume so, and if that's true, then it doesn't seem like fixing this in DwC would have very severe consequences.

As of the 2020-04-09 snapshot of GBIF. the type term is filled in for 228614731 (16.3% of) Occurrence records, and of those, 118583281 (51.9%) are legal unqualified DCMI term names and 1372635 (0.6%) are valid DCMI IRIs. The rest are labels with various kinds of orthography and various languages or garbage.

So, again the better choice based on usage is the original intention, which is dc:type.

baskaufs commented 4 years ago

I just noticed an open pull request that brings up the same concern as Item 4 above: https://github.com/tdwg/dwc/pull/172 . It can be deleted if item 4 is resolved here.

tucotuco commented 4 years ago

Closed pull request #172 without merge. Let's resolve item 4 here. I think we have a consensus that UseWithIRI should go under tdwgutility management and be removed from term_versions.csv. However, I think we also have a consensus that term_versions.csv will no longer be necessary once the full scripting pathway is functional.

peterdesmet commented 4 years ago

Agreed on last sentence in John's comment.

qgroom commented 4 years ago

With regard to the comments on the use of language codes. I recently had the example of an DwC file including vernacular names in Icelandic. I recommended to the authors that they tag the language of the whole resource as Icelandic, so that it was clear what language the vernacular names were in. Having said that, everything else in the resource and its metadata are either in English or Latin. You might say that they could have used the vernacular names extension, but if that is the case then DwC files would always have to be a combination of English and Latin, except for the data in the vernacular names extension. The language term is therefore a rather blunt instrument for labeling the resource and perhaps we should be making Darwin Core more flexible, prescriptive and accurate about the languages that are used.

baskaufs commented 4 years ago

Moved checkbox 4 to a separate issue at https://github.com/tdwg/dwc/issues/266 so that this issue could be closed.

baskaufs commented 4 years ago

All checkboxes ticked