Expand `@dataset` or introduce new type to allow for citing of parts/subsets of datasets

The entry type @dataset is already used in repositories like e.g. zenodo (example) and it seems it will also be adopted by Dataverse (according to this issue).

More information on the entry type @dataset was already stated in #880

However, recommendations on data citation also include the advice that when only parts of a dataset are used those should specifically be cited.

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. (2014). Martone M. (ed.) San Diego CA: FORCE11. DOI: 10.25490/a97f-egyk
1. Specificity and Verifiability
Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.
- Ball, A. & Duke, M. (2015). How to Cite Datasets and Link to Publications. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: https://www.dcc.ac.uk/guidance/how-guides/cite-datasets see comments on Data citation for authors > Granularity

Dataverse is already providing citation examples for files and the containing collections/datasets, e.g. DOI:10.7910/DVN/EDQQ4O/FKJNCC. There the file contained has the entry type @incollection in bibtex format.

While @incollection works, I do not think it is ideal as many citation styles explicitly want to have special treatment for datasets; Since subsets (e.g. a subfolder) of datasets still are datasets representing them with @incollection would hinder the separate treatment.

Thus, I believe that @dataset should be expanded or a whole new entry type @indataset analogous to @incollection should be introduced.

In the coming days I could provide suggestions for both cases (expanding @dataset and introducing @indataset) if this is something that might be added to BibLaTeX.

I think that this is a good idea and since @dateset is not so widely used yet, the opportunity to enhance the data model for it is a good one. Please do provide suggestions as to what fields would be appropriate.

Apologies for the long silence.

After looking through examples from other repositories and recommendations listed in #880 I would suggest this data model:

1. For a whole dataset (@dataset) the record can already be compiled with the mandatory and the allowed fields, e.g.:

@dataset{<key>,
title = {<title>},
date = {<date, ISO>},
publisher = {<publisher>},
eprint = {<persistent uri>}, (or DOI)
eprinttype = {<persistent eprint/uri type, e.g. hdl>},
url = {<url>},
urldate = {<date of access, ISO},
editor = {<editors>},
author = {<creators/authors>},
(also editora, editoratype etc. can be used if necessary)
version = {<version>},
language = {<language>},
keywords = {<keywords>},
abstract ={<abstract/description>}
}

The only thing I do miss in here is a field to enter a hash like e.g. sha1: or, as stated in #880, UNF:. While this is something that won't be required for citation, it is quite useful for future reference in the bibliography database (similar to keywords, abstract etc.). Currently I am resorting to enter that information into note but in the long run I think a dedicated field might be better.

For parts of a dataset @incollection can be used as follows:

@incollection{<key>,
title = {<title>},
date = {<date, ISO>},
publisher = {<publisher>},
eprint = {<persistent uri>}, (or DOI)
eprinttype = {<persistent eprint/uri type, e.g. hdl>},
url = {<url>},
urldate = {<date of access, ISO},
author = {<creators/authors>},
(also editora, editoratype etc. can be used if necessary)
version = {<version>},
language = {<language>},
booktitle = {<title of containing dataset>}
bookauthor = {<authors of containing dataset>},
editor = {<editors of containing dataset>},
keywords = {<keywords>},
abstract ={<abstract/description>}
}

Again a field for hash would be nice. Also, if the subdataset does not have a dedicated PID or URI, a field to indicate how to query for the subset would be required. This is related to R8 and R9 of the recommendations by the Data Citation WG of the RDA (Rauber, A., Asmi, A., Uytvanck, D. van, & Pröll, S. (2015). Data citation of evolving data: Recommendations of the Working Group on Data Citation (WGDC). DOI: 10.15497/RDA00016)

Using booktitle and bookauthor to store the information of the containing dataset feels a bit odd, but I think it should work. What I am not sure about is if editor, which for @incollection refers to the containing collection/dataset, is needed for the subdataset. What I know for sure from the use case I am coming from, is that we will need a way to denote editora for a part of a dataset and for the containing dataset. These two might even be different from one each other.

I am attaching a file with some examples compiled from the repository I work with. examples.txt

Since we already have @dataset I think it would be a bit odd to use @incollection (which is really in a @collection) for something that is "in a @dataset". So on first glance I agree that @indataset would make more sense.

But I'm wondering how exactly we should pull this off. You already mentioned that common field names like bootktitle don't quite feel right and that we might get in trouble with the role of editor etc. I'm also worried that an @indataset entry does not necessarily have the same straightforward connection to its parent @dataset as say an @incollection has to its parent @collection (I'm guessing one could have several 'nesting levels').

We already discussed UNF and friends in #880 and at the time I wasn't too sure how useful and widely used it would be, but if enough people think it is useful we might as well add something as fingerprint, uniqueid or some such to the data model now. (It might not only be interesting for data sets as UNFs, but also for software and the software heritage ID). If we implement this like eprint we could add a fingerprinttype field upon which we could branch representation if required. [What I would like to avoid, though, - here and generally - is to add all sorts of overly specific fields and entry types to the standard data model/styles that are only useful to a very small audience. I realise that with some things we might have a bit of a chicken-or-egg problem: Certain things might not be popular yet, because they are not properly supported by the software yet.]

Yes, in the long term a dedicated type for subsets of datasets like @indataset might be the way to go. But I think this would also require some new field names.

Also, this has not to be solved right now. I pointed to this issue over at dataverse to get more attention and maybe already get another 'user' on board, to overcome the chicken-or-egg issue. Next things I'm going to do to get more attention and input to the @indataset is to see what the RDA has to say on this.

Regarding adding a field like fingerprint: I would very much welcome this already. But implementing it with fingerprinttype might already be an overkill, because a notation like sha1:98e2c729d79c410b8e1bfd8d46517dbf3c2e49ab or UNF:3:DaYlT6QSX9r0D50ye+tXpA== suffices for the purpose of a reference.

plk / biblatex

Expand `@dataset` or introduce new type to allow for citing of parts/subsets of datasets #1103