tdwg / applecore

Darwin Core guidelines for herbaria
3 stars 1 forks source link

datasetName & datasetID #37

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Initially we thought that datasetID could not be used, as there is no registry 
for this. But just like for the institutionID, we could use the GBRDS, which 
issues a UUID per dataset (called "resource"), e.g. 
http://gbrds.gbif.org/browse/agent?uuid=ada5d0b1-07de-4dc0-83d4-e312f0fb81cb

1. I think it is a suitable value for datasetID. However, one can only provide 
after the dataset has been published.

2. Should we use the UUID or the full link (see issue 28)

3. Should the ID provided in datasetID apply to the same unit as the name 
provided in datasetName (just like for collectionCode/ID and 
institutionCode/ID). I'm asking because we currently advise to use datasetName 
to indicate the name of the subcollection (if any), which is definitely NOT the 
same unit as the whole dataset.

Original issue reported on code.google.com by peter.de...@gmail.com on 12 Nov 2011 at 8:19

GoogleCodeExporter commented 9 years ago
For all Canadensys datasets I have decided to populate datasetName and 
datasetID as follows:

datasetName: "University of British Columbia Herbarium (UBC) - Vascular Plant 
Collection"

1. We populate this term with the title of the dataset, as recorded in eml.xml 
metadata. It's also the name that appears on IPT: 
http://data.canadensys.net/ipt/resource.do?r=ubc-vascular-specimens
2. It follows the definition 
(http://rs.tdwg.org/dwc/terms/index.htm#datasetName): "The name identifying the 
data set from which the record was derived." Once you download the dataset and 
merge it with other records or use it for something, datasetName does indeed 
identify the dataset from which the record was derived: the original published 
dataset on IPT.
3. Two collections already populated this term, to indicate the subcollection. 
For UBC, that subcollection is its own dataset on IPT, so the name still 
indicates the subcollection. For MT, "Marie-Victorin Herbarium (MT)" now 
replaces "vascular plants" & "bryophytes". Those subcollections use the same 
running number, database and dataset, so there is really no point in indicating 
the subcollection.
4. Users can use datasetName to cite the dataset. It will be recommended as 
such in the Canadensys norms.

datasetID: "http://data.canadensys.net/ipt/archive.do?r=ubc-vascular-specimens"

1. We populate this term with the URL of the dataset.
2. It follows the definition 
(http://rs.tdwg.org/dwc/terms/index.htm#datasetID): "An identifier for the set 
of data. May be a global unique identifier or an identifier specific to a 
collection or institution." The URL is indeed an identifier for the dataset, it 
is even globally unique. You can click the URL and get more information, so it 
is actionable. As for persistence: it will work until the Canadensys domain 
disappears or the publisher decides to remove the dataset. Any link from the 
GBIF registry or a data paper will also be broken when that happens, so we have 
that problem anyway.
3. Right now, this URL is a better alternative than the non-clickable UUID 
issued by GBIF. It can also be provided BEFORE the dataset is published (the 
GBIF UUID is only issued AFTER publication). If at some point GBIF issues DOIs 
for datasets, those can be used, but they will probably point to the same URL.
4. The URL in datasetID and the name in datasetName apply to the same unit (see 
point 3 in the issue description). This is now consistent with 
collectionCode/ID and institutionCode/ID.
5. Users can use the datasetID to cite the dataset. It will be recommended as 
such in the Canadensys norms.
6. Currently, this link is not included in the eml.xml metadata, so the user 
cannot find the original dataset back. This is a bug: 
http://code.google.com/p/gbif-providertoolkit/issues/detail?id=833. Even if it 
is included in the eml.xml, it is still useful to have it for each record, 
especially if those records are merged with other records by an aggregator.
7. I'm wondering if I should point to the a) human readable page: 
http://data.canadensys.net/ipt/resource.do?r=ubc-vascular-specimens or b) the 
archive itself: 
http://data.canadensys.net/ipt/archive.do?r=ubc-vascular-specimens. a) is 
easier for a human to consume when he/she clicks the link in a citation, but b) 
can be used for datasets not published by IPT, can be consumed by a machine and 
is already used by the GBRDS. Advise would be welcome.

What do you think? Does all of this make sense?
Should we update the AppleCore guidelines to use this recommendation?

Original comment by peter.de...@gmail.com on 29 Mar 2012 at 2:49

GoogleCodeExporter commented 9 years ago
To elaborate on point 7: our datasets will include the datasetID for all 
records: it is the URL to the dataset and I will advise people to include it in 
their citation. The question is, what flavour do I point to?

A. The resource: http://data.canadensys.net/ipt/resource.do?r=acad-specimens
+ human readable
+ best user experience for someone clicking on the link in the citation
+ crawl-able by a machine
+ includes links to other flavours (eml, archive, rtf)
+ used in data papers
- not immediately readable by a machine
- cannot be used by non-IPT datasets
- not used by the GBRDS

B. The archive: http://data.canadensys.net/ipt/archive.do?r=acad-specimens
+ readable by a machine (e.g. by http://tools.gbif.org/dwca-validator/)
+ includes everything in one package: data & metadata
+ can be used for non-IPT datasets
+ also used in data papers
+ used by the GBRDS
- user probably doesn't expect a file download when clicking a link in a 
citation
- no easy way to consume metadata for a human (is XML)
- cannot be used for metadata only datasets

C. The metadata: http://data.canadensys.net/ipt/eml.do?r=acad-specimens
+ readable by a machine (is XML)
+ can be used for metadata only datasets
+ used by the GBRDS
- no easy way to consume metadata for a human (is XML)
- no link to other flavours (this is a serious issue!)
- cannot be used by non-IPT datasets, unless the publisher publishes that file 
in addition to the archive
- not used in data papers

Original comment by peter.de...@gmail.com on 29 Mar 2012 at 3:56

GoogleCodeExporter commented 9 years ago
C. Apparently there is some sort of a link in the eml, in 
packageId="http://data.canadensys.net/ipt/resource.do?id=acad-specimens/v2". 
There is no direct link to the archive (and data) however.

Original comment by peter.de...@gmail.com on 29 Mar 2012 at 4:02

GoogleCodeExporter commented 9 years ago
Hey Peter, my personal vote would be A. As you point out, it's human readable 
which for me is a huge plus for non-technical consumers, and it includes 
additional links. 

The packageId can be thought of as a versioned identifier for the metadata. 

Original comment by kyle.br...@gmail.com on 30 Mar 2012 at 12:48

GoogleCodeExporter commented 9 years ago
I have been going back and forth between A and B, but now I'm leaning more to 
A. It's the URL I use all the time to link to the archive, because I expect 
much higher human, than machine consumption. Also, with A I can track visits 
with Google Analytics.

PS: The packageId doesn't work: 
http://data.canadensys.net/ipt/resource.do?id=acad-specimens/v2 links to the 
last version of last visited dataset, not Acadia.

Original comment by peter.de...@gmail.com on 30 Mar 2012 at 12:59

GoogleCodeExporter commented 9 years ago
[D.] A metadata system/protocol with a dataset URI (datasetID) that could do 
content negotiation and lead a machine to a machine readable resource such as C 
(http://data.canadensys.net/ipt/eml.do?r=acad-specimens) [or in the future an 
RDF/XML serialization alternative...?] and a human to a human readable resource 
such as A (http://data.canadensys.net/ipt/resource.do?r=acad-specimens) might 
perhaps be worth contemplating...?

A content negotiation capacity might not need to be very complex. W3C has some 
interesting guidelines for inspiration at: 
http://www.w3.org/TR/2006/WD-swbp-vocab-pub-20060314/

Alternatively something like RDFa could be embedded into the human readable web 
page at A to give a machine something to read...? [RDFa inside the web page at 
A could perhaps also be a future feature]. So both A and [D] might both work 
fine based on the human readable response you aim at.

Another issue (at least to think about) is that the IPT might perhaps be 
replaced by another data publishing technology for the same dataset in the 
future - and that providing a more neutral URI as the datasetID might perhaps 
be useful?

Original comment by dag.endresen on 30 Mar 2012 at 1:17

GoogleCodeExporter commented 9 years ago
Hi Dag.

I agree with D and the fact that ideally, I should use a neutral URI (such as a 
DOI), but I am looking for a pragmatic solution I can use right now. In that 
sense A is better than B, because I can always extend it with more 
possibilities (like adding RDFa, as you suggest).

Original comment by peter.de...@gmail.com on 30 Mar 2012 at 1:30

GoogleCodeExporter commented 9 years ago
Hi Peter! Yes, a pragmatic ID to be used now does not need to be the datasetID 
to be used forever. Things evolve and it is a fact that identifiers will also 
evolve. The same dataset can be given a DOI in the future [or a 
content-negotiation-capable URI]. Even this solution could be replaced by a 
more evolved technology further ahead...

So you are probably correct that a pragmatic solution based on what works now 
is the best solution. Then A with the aim to add some RDFa further ahead sounds 
good.

Original comment by dag.endresen on 30 Mar 2012 at 1:53

GoogleCodeExporter commented 9 years ago
Thanks to a suggestion from Tim Robertson, we have chosen for a more technology 
agnostic URL:

http://dataset.canadensys.net/acad-specimens, which redirects to option A: 
http://data.canadensys.net/ipt/resource.do?r=acad-specimens

That way, we have nicer and shorter URLs in the wild and an added layer to 
redirect if we plan to host our datasets somewhere else or if we want to add 
specific redirection (human, machine, etc.). In that sense, these URLs are not 
that different from DOIs. DOIs are still better because you pay a service to 
have them working indefinitely. We'll try our best to keep ours working as long 
as possible.

Original comment by peter.de...@gmail.com on 30 Mar 2012 at 6:55

GoogleCodeExporter commented 9 years ago
Hi Peter, I think that the redirection from a more technology neutral URI is a 
very good solution.

Original comment by dag.endresen on 31 Mar 2012 at 4:49

GoogleCodeExporter commented 9 years ago
Canadensys now has DOIs for its IPT datasets. e.g. 
http://dx.doi.org/10.5886/g7j6gct1 assigned via DataCite

Original comment by davidpsh...@gmail.com on 13 Dec 2012 at 4:50

GoogleCodeExporter commented 9 years ago
It is nice to see that you have come to a solution for this for Canadensys. I'm 
curious, though, what you think the identifier actually identifies. To me it 
seems that it does not identify a data set, but rather the access point for 
latest version of data published from a particular source. The data sets one 
can find at that location are mutable over time, and as such the identification 
of them is a tricky business. I would be careful to be very explicit when 
telling consumers what the DOI really refers to because of this.

Original comment by gtuco.bt...@gmail.com on 13 Dec 2012 at 12:44