Closed baskaufs closed 4 years ago
Regarding the technical difference involving http://dublincore.org/usage/terms/history/#rightsT-001
: Currently the "Normative Document" says that http://dublincore.org/usage/terms/history/#license-002
replaces http://dublincore.org/usage/terms/history/#license-001
. This is correct as far as the assertions made by Dublin Core are concerned. According to the Dublin Core historical record, license-002
replaced license-001
in 2008.
However, the TDWG Executive Decision http://rs.tdwg.org/decisions/decision-2014-11-06_17 deprecated the Dublin Core rights term in favor of the Dublin Core license term in 2014. So from the standpoint of Darwin Core, license-002
replaced rightsT-001
, not license-001
. (license-001
was never part of DwC.) The authoritative Dublin Core metadata will still make the assertion currently in the "Normative Document", but I think the replacement made by the Exec in 2014 is the one that actually needs to be included in the metadata provided by TDWG. That's consistent with the way we use dcterms:replaces
in every other circumstance and would be necessary information for any application that was trying to use machine readable metadata from TDWG to interpret old DwC datasets that used terms that are currently deprecated.
A secondary consequence of this is that http://dublincore.org/usage/terms/history/#rightsT-001
should be included in the "Normative Document" with a status of deprecated
. It's already there in the rs.tdwg.org` data and since deprecated versions have no effect on generating the Quick Reference Document, adding it to the "Normative Document" should not cause any problems.
Regarding the technical difference involving http://dublincore.org/usage/terms/history/#language-007
vs. http://dublincore.org/usage/terms/history/#languageT-001
. http://dublincore.org/usage/terms/history/#languageT-001
is the version for dcterms:language
, while http://dublincore.org/usage/terms/history/#language-007
is the most recent version for dc:language
.
The current description in the Quick Reference Guide (generated from the Normative Document) provides the following guidance for using dcterms:language
: "Recommended best practice is to use RFC 5646 as a controlled vocabulary." with examples "en
(for English), es
(for Spanish)". However, the examples are inconsistent with usage of the dc:
and dcterms:
namespaces as outlined in table 3.3. of the DwC RDF Guide, which specifies that the values of dcterms:
terms should have non-literal (i.e. IRI) values and that dc:
terms should be used with string literals. If dcterms:language
is actually used, its value (as recommended by the RDF guide) should be a MARC ISO 639-2 language IRI. This recommendation is also consistent with the recommendation for use of dcterms:langage
given by Audubon Core.
The other somewhat problematic issue is that Audubon Core recommends the following for dc:language
: "Language(s) of resource itself represented in the ISO639-2 three-letter language code. ISO639-1 two-letter codes are permitted but deprecated." This recommendation to use 3-letter codes is somewhat at odds with the DwC examples showing two-letter codes.
The problem here is that historically, the practice for using several of the dcterms:
terms has been incorrect with respect to the recommendations of Dublin Core, i.e. using dcterms:
terms whose values are recommended to be non-literal with literal values. How do we deal with this without breaking existing implementations? Also, how do we make usage in Darwin Core consistent with usage in Audubon Core?
The rs.tdwg.org repo metadata currently shows dcterms:language
as recommended
and dc:language
as also recommended
but replacing dcterms:language
. That's a kind of weird way to indicate that we should stop using dcterms:language
in the wrong way, but it's OK to use either of them with the correct kind of value. However, the Normative Document (and therefore the Quick Reference Guide) say nothing at all about dc:language
and just tell people to use dcterms:language
incorrectly.
I'm not sure how this should be fixed. How often is a value for dcterms:language
even provided as part of metadata? What would be broken if we just told people to use dc:language
and dcterms:language
correctly? In the Quick Reference Guide dc:language
could be categorized in the record-level terms section and dcterms:language
could be put down in the UseWithIRI
section.
The technical issue involving http://dublincore.org/usage/terms/history/#type-006
is similar to the issue described in the previous comment. The Normative Document describes dcterms:type
(http://dublincore.org/usage/terms/history/#typeT-001
) and ignores dc:type
(http://dublincore.org/usage/terms/history/#type-006
). The examples listed in the QRG are again string literals rather that IRIs - also incorrect usage for a dcterms:
term.
In this case, It seems clear to me that this should be changed so that dc:type
is prescribed for use with string literals. The reason is that the kinds of resources described by type
are media resources, and given that Audubon Core is the media vocabulary of TDWG, its recommendations should be primary over DwC. (The AC recommendations give the correct usage for dc:type
and dcterms:type
. See the term list for details.)
The rs.tdwg.org repo metadata currently has dcterms:type
as deprecated with dc:type
as its designated replacement. I think this makes sense because I'm not sure that there is a compelling reason why very many people would want to provide an IRI value for the term when they are probably inclined to use one of the string values. People who are really interested in RDF or linked data and want to use IRI values will probably want to use rdf:type
instead of dcterms:type
anyway. See Section 3.1 of the RDF guide for more on use of rdf:type
.
We should consider a course of action here from the standpoint of stability and not "breaking" applications. How often is dcterms:type
used outside the Audubon Core media extension? Does the AC media extension give correct advice on using the two terms? I presume so, and if that's true, then it doesn't seem like fixing this in DwC would have very severe consequences.
Regarding point 4 http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06
: it is used by the build script to group terms following that line under the section https://dwc.tdwg.org/terms/#usewithiri
I agree that it would be better to remove it and adapt the build script (which we might have to do anyway for https://github.com/tdwg/dwc/issues/251). It would be best to record this as a separate issue.
If the build script is to be re-done and the term_versions.csv file is to continue to be used for the time being to build the QRG, then http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06
could just be left. My primary concern is that if people will continue to be looking to the "normative document (term_versions.csv) to learn precisely what is and is not part of Darwin Core, it should agree exactly with the vocabulary metadata that is found elsewhere. If we stop referring to it as the go-to place for people to learn what is in the standard, then the presence of http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06
isn't much of a problem.
If you want to see what's in the http://rs.tdwg.org/dwc/terms/attributes/ namespace now, you can browse to its URI. You'll see that they are not only terms used in Darwin Core. There are now a bunch of other "made up" terms that we use in Audubon Core to indicate term properties like whether they are repeatable or not, plus new TDWG-wide terms like tdwgutility:Vocabulary
. So the namespace http://rs.tdwg.org/dwc/terms/attributes/
is definitely not really a part of Darwin Core any more, despite the form of its URI. That's why we've been referring to it as the tdwgutility:
namespace.
I'm not sure how this should be fixed. How often is a value for
dcterms:language
even provided as part of metadata? What would be broken if we just told people to usedc:language
anddcterms:language
correctly? In the Quick Reference Guidedc:language
could be categorized in the record-level terms section anddcterms:language
could be put down in theUseWithIRI
section.
As of the 2020-04-09 snapshot of GBIF. the language term is filled in for 216998286 (15.4% of) Occurrence records, and of those, 138804401 (64%) are two-letter language codes, and 1245724 (0.6%) as three-letter language codes. The rest are fully spelled out names of languages or garbage that is not supposed to be in that field. There appear to be no IRIs in that field, and I can not find any authority that gives IRIs for langauges, though I don't have access to the actual standards ISO 639-2 or ISO 639-3. The documentation for dcterms:language even allows for a string literal if it is a language tag. In any case, the intention for every Dublin Core borrowed term was that it be the one for the string literal, and only late in the game was it realized that dcterms: was not it. They really all should be dc:, with corresponding dcterms: versions for the DwC IRI section.
One of the issues with the various dcterms:
namespace terms is the lack of clear guidance on what authoritative source of IRIs should be used as values. What I suggest in this case is to make the DwC recommendation for dcterms:language
consistent with Audubon Core. The recommendation here is to use the LOC IRIs for the ISO 639-2 three-letter codes. The LOC has several sets of controlled vocabularies for languages and I'm not sure why Bob Morris and the AC task group chose that one over the others (I wasn't on that team and as the review manager I didn't have any reason to question their choice). I think it was because the three letter codes allowed for more variants than the two-letter codes that are in more common use. In any case, it's a fait accompli and since we don't have a clear better option, I would suggest that we just be consistent with AC.
Ultimately, it would be good to have some kind of JSON or JSON-LD file that related those IRIs to the two letter codes so that the string values could be mapped to the IRIs. That hasn't happened in AC because nobody yet has indicated that they cared enough for us to expend the energy to do it. But for now, just have a vocabulary to recommend is probably good enough.
We should consider a course of action here from the standpoint of stability and not "breaking" applications. How often is
dcterms:type
used outside the Audubon Core media extension? Does the AC media extension give correct advice on using the two terms? I presume so, and if that's true, then it doesn't seem like fixing this in DwC would have very severe consequences.
As of the 2020-04-09 snapshot of GBIF. the type term is filled in for 228614731 (16.3% of) Occurrence records, and of those, 118583281 (51.9%) are legal unqualified DCMI term names and 1372635 (0.6%) are valid DCMI IRIs. The rest are labels with various kinds of orthography and various languages or garbage.
So, again the better choice based on usage is the original intention, which is dc:type.
I just noticed an open pull request that brings up the same concern as Item 4 above: https://github.com/tdwg/dwc/pull/172 . It can be deleted if item 4 is resolved here.
Closed pull request #172 without merge. Let's resolve item 4 here. I think we have a consensus that UseWithIRI should go under tdwgutility management and be removed from term_versions.csv. However, I think we also have a consensus that term_versions.csv will no longer be necessary once the full scripting pathway is functional.
Agreed on last sentence in John's comment.
With regard to the comments on the use of language codes. I recently had the example of an DwC file including vernacular names in Icelandic. I recommended to the authors that they tag the language of the whole resource as Icelandic, so that it was clear what language the vernacular names were in. Having said that, everything else in the resource and its metadata are either in English or Latin. You might say that they could have used the vernacular names extension, but if that is the case then DwC files would always have to be a combination of English and Latin, except for the data in the vernacular names extension. The language term is therefore a rather blunt instrument for labeling the resource and perhaps we should be making Darwin Core more flexible, prescriptive and accurate about the languages that are used.
Moved checkbox 4 to a separate issue at https://github.com/tdwg/dwc/issues/266 so that this issue could be closed.
All checkboxes ticked
As of 2020-07-09, the metadata in the master branch of rs.tdwg.org differs from the Normative Document in the following ways:
flags
column found in the Normative Document (also discussed in a separate issue). If the 2018-09-06 versions are real, then two versions should be generated. If not, then only the 2017-10-06 version should be generated.http://dublincore.org/usage/terms/history/#language-007
,http://dublincore.org/usage/terms/history/#type-006
, andhttp://dublincore.org/usage/terms/history/#rightsT-001
. See this replacement table for details. It is not clear whether this difference would have any effect on the generation of the Quick Reference Guide.recommended
term having versionhttp://rs.tdwg.org/dwc/terms/version/accordingTo-2009-01-21
. That version is missing completely from the Normative Document. That term was a legacy property that had subproperties of the formdwc:xAccordingTo
. When semantics were removed from the basic DwC "bag of terms", this term became irrelevant. I don't think it was ever actually used for anything. Nevertheless, I believe that it should be retailed in the historical record, but have its status changed fromrecommended
todeprecated
.http://rs.tdwg.org/dwc/terms/attributes/UseWithIRI
is used in the "Normative Document" to categorize terms that should be grouped together in the UseWithIRI section. However, that class itself is NOT part of Darwin Core (and doesn't appear on the Quick Reference Guide itself). Thetdwgutility:UseWithIRI
term and all the other ones in that namespace were moved out of Darwin Core so they could be managed more nimbly as needed. Can we removehttp://rs.tdwg.org/dwc/terms/attributes/UseWithIRI-2017-10-06
as a row in the "Normative Document" without breaking anything? It doesn't belong there.