Data availability section

JGilbert-eLife commented 6 years ago

Following the JATS4R recommendations (https://jats4r.org/data-availability-statements), eLife includes a Data availability section. This contains a paragraph detailing where the source data for a work can be found. It may in addition contain details of generated datasets (those produced by the work) and/or previously published datasets (those used by the work).

We would need the ability to add data availability sections and add structured generated and previously published datasets to this.

michael commented 6 years ago

Just looked at the JATS4R recommendations. Any chance we could solve this just as free form text? I have a feeling that creating a separate reference list in the data availability section is overkill. It could be solved with links or if a data set is mentioned in the main reference list, just cite that one.

Melissa37 commented 6 years ago

HI there

They have to be tagged as references in order to conform to the Force11 Data Citation Principles, which eLife has signed up to: https://www.force11.org/datacitationprinciples

The reason we put them here and not in the reference section is so that we can distinguish between datasets previously created and those generated for the study, and this information is important to send on to Crossref: https://www.crossref.org/blog/data-citations-and-the-elife-story-so-far/ This is the easiest way for us to manage the process from author via editorial to production and publication and retain the information we need to.

The JATS4R recommendation for Data Availability Statements is close to being published and that allows for this type of tagging.

Happy to talk further about this, but there is a lot of history and method involved in what we've set up:-)

michael commented 6 years ago

I see. I can't dig into this deeper now, but maintaining a separate reference list is challenging technically and UI-wise so we should expect that this issue will take more than a few days to implement.

@Melissa37 can we make this a stretch goal for 2.0, so we can discuss more before starting to implement this.

michael commented 6 years ago

We were just talking about this more. We see the use-case of internal data (data generated for the study), but we would suggest that instead of tracking them in a data availability section, to allow adding Datasets (CSV + metadata) to a publication. We can provide a user interface for that and include the CSV+metadata in the Dar.

For every external data used, regular references in the main reference list would be suitable.

Melissa37 commented 6 years ago

@michael we need to talk about this further and explain.

The separate reference list is pretty much a requirement, sorry! We've implemented this in our current proofing tool. It's a compromise between editorial requirement and display; and XML requirements for distribution and re-use. It's based on many workflows and complies with JATS4R

Data generated for the study is not added to the publication, it's stored in external databases that we are linking to and referencing. It is much better for data reuse for it to not be hidden as files within a research paper, but be treated as an entity in its own right in the correct repository. There are many subject-specific repositories as well as generalist repositories that we encourage authors to use, as well as them using them automatically before sending their paper to us.

Thanks!

michael commented 6 years ago

@Melissa37 @JGilbert-eLife Can you help me answering these question so I better understand the problem.

What is the purpose of a Data Availability statement?
How is this information used later (by readers at the display, and by machines for data retrieval)?
Wouldn't it be beneficial to have data sets included in the Dar in the future? (we could still support external references but in many cases the data could just sit within the publication)

Melissa37 commented 6 years ago

Hi @michael

I have copy and pasted answers direct from the text available in this article: https://www.nature.com/articles/sdata2018259 for points 1 and 3.

What is the purpose of a Data Availability statement?

DASs provide a statement about where data supporting the results reported in a published article can be found, including, where applicable, unique identifiers linking to publicly archived datasets analyzed or generated during the study. In addition, DASs can increase transparency by providing a reason why data cannot be made immediately available (such as the need for registration, due to ethical or legal restrictions, or because of an embargo period). Some research funders, including Research Councils UK, require data availability statements to be included in publications so it is an important element of a publisher’s data policy. It is recommended that publicly available datasets referred to in DASs are also cited in reference lists.

How is this information used later (by readers at the display, and by machines for data retrieval)?

This blog post should help explain: https://www.crossref.org/blog/why-data-citation-matters-to-publishers-and-data-repositories/

And this one: https://www.crossref.org/blog/data-citation-what-and-how-for-publishers/

Data reuse is an aspiration we are all trying to support, to prevent waste, increase reproducibility, and generally help move the needle on research more. It is happening now, but we want to promote it further. By putting data in reputable repositories that are open and mineable, there is more chance it will be re-used than in hidden in a small journal like eLife.

Wouldn't it be beneficial to have data sets included in the Dar in the future? (we could still support external references but in many cases the data could just sit within the publication)

Publishers should provide or point to a list of recommended repositories for data sharing. Many publishers already maintain such a list. The Registry of Research Data Repositories (Re3Data, https://www.re3data.org) is a full-scale resource of registered repositories across subject areas. Re3Data provides information on an array of criteria to help researchers identify the ones most suitable for their needs (licensing, certificates & standards, policy, etc.). A list of recommended repositories is provided by FAIRsharing.org, where some publishers also maintain collections of recommended resources. FAIRsharing started out as a resource within the life sciences but has recently expanded and now includes repositories within all disciplines. Where a suitable repository does not exist for a given discipline or subject area, publishers should provide guidance for the use of a general purpose or institutional repository where these meet the recommendations of the repository roadmap15 (briefly, by providing authors’ datasets with a globally resolvable unique identifier - ideally a DataCite DOI where possible, or other PID, providing a suitable landing page, using open licenses, and ensuring longevity of the resource). Some research funders may stipulate that data must be deposited in a domain-specific repository where possible, which aligns well with publishers providing lists of recommended repositories.

Basically, DAR can hold the data for a paper, ie the data behind the graphs and tables etc, but not huge datasets that might have been generated as part of the work.

eLife is not against the idea of re-visiting the way we present the references to the datasets. However, we are following JATS4R guidelines, which should be supported by Texture. I have a concern that it would limit Texture's use by a wider audience if JATS4R recommendations are not supported.

Happy for us at eLife to discuss this further and re-visit how we do this, but I am concerned that "maintaining a separate reference list is challenging technically and UI-wise" is not a good enough reason to not support something. Some publishers publish two reference lists (a separate data reference list), and our solution is similar to that.

michael commented 5 years ago

Thanks for the thorough explanations. Let me try to summarise the requirements.

So we have:

publicly available citable datasets
- data citations are used (e.g. a figure caption cites a reference in the main reference list)
new datasets produced by the study
- ability to include data right in the publication (CSV inside the Dar, directly reproducible)
- ability to link to external data repository (large datasets)
- metadata about a dataset must be collected and displayed
  - authors: who created the dataset
  - title of the dataset
  - url to repository
  - license
  - publisher (which data repository)
  - ???
data availability statement provides a textual summary, about where to find data files for this publication (can reference external datasets and internal datasets)

@Melissa37 is this correct?

michael commented 5 years ago

I actually like the idea of having datasets listed separately from the other citations. No matter of how we tag it eventually but this would allow us to show a (maybe visually distinctive) "Datasets" section. There you'd see a complete list of all datasets relevant to the paper, and some of them may be even accessible directly from the Dar. For others you'd follow a link to a repository.

Melissa37 commented 5 years ago

@michael Thanks for thinking this through.

My summary would be:

Data citations should always be captured within an <element-citation> or <mixed-citation>
These references can either be within the main <ref-list> for the article OR within a <ref-list> directly in the <sec sec-type="data-availability"> element OR as <element-citation> or <mixed-citation> elements directly in the <sec sec-type="data-availability"> OR within a sub-level <ref-list> at the end of the article.
For an eLife MVP we'd require 1) within the main <ref-list> for the article and 2) as <element-citation> elements directly in the <sec sec-type="data-availability">

metadata about a dataset must be collected and displayed:

specific use information (one of 4 options: supporting; generated; analyzed; non-analyzed)
authors: who created the dataset
title of the dataset
dataset location (repository)
year
identifier
who assigns the identifier
type of identifier (eLife has a controlled list of: doi; archive; accession; other.
url to repository

Example:

<element-citation publication-type="data" specific-use=”generated”>
<name>
<surname>Read</surname> 
<given-names>K</given-names>
</name>
<data-title>Sizing the Problem of Improving Discovery and Access to NIH-funded Data: A Preliminary Study (Datasets)</data-title>
<source>Figshare</source>
<year>2015</year>
<pub-id pub-id-type="doi" assigning-authority="figshare" xlink:href=
"https://doi.org/10.6084/m9.figshare.1285515">https://doi.org/10.6084/m9.figshare.1285515</pub-id>
</element>

We'll require a visual UI component to check the @specific-use as well as @pub-id-type and @assigning-authority, but these are not display items.

I agree with you that having the Data availability statement detail the data available within DAR will be cool!

Thanks Melissa

michael commented 5 years ago

Thank you. I think I see the picture more clearly now. I also found an example with an extensive Data Availability section here: https://elifesciences.org/articles/36495/figures#data that has separate lists of datasets (generated vs previously published).

Small request: Please let's not discuss requirements in JATS4R-terms (these are tagging recommendations, not requirements).

So regarding handling data citations we need the following:

4 types of data citations (supporting; generated; analyzed; non-analyzed)
data citations referenced in the main text should be displayed in the main reference list
data citations not referenced in the main text should be displayed in the Data Availability Section broken down into categories (supporting, generated, ...)

Questions:

is it true that supporting means that the dataset has been published previously?
What does analyzed, non-analyzed mean?
Is it correct that data citations referenced in the main text do not appear in the data availability section? Or should they be in both places then?

Melissa37 commented 5 years ago

Small request: Please let's not discuss requirements in JATS4R-terms (these are tagging recommendations, not requirements).

Fair enough :-) eLife will conform to JATS4R recommendations though, so it's good context for why we might ask for things as we're going with a community decision and not necessarily what we might have chosen on our own. I'll indicate what eLife will do with respect to those recommendations then to give context - is that OK?

4 types of data citations (supporting; generated; analyzed; non-analyzed)

For eLife, we'll only deal with supporting and generated

What does analyzed, non-analyzed mean?

eLife won't use them, but for reference: analyzed - Supporting data that were analyzed (but not generated) for the study. non-analyzed - Supporting data that were not analyzed (not generated) for the study. Bit like a really seminal paper that everyone's read and adds to their citation list but it's not directly relevant.

Is it correct that data citations referenced in the main text do not appear in the data availability section? Or should they be in both places then?

They could be in both places - they should not really be from a pure XML/semantic point of view. The difficulty is that if the author cites the dataset in the text it will be in the main reference list as well as the data section. The main reference list at the end of the article pulls citations from the text, but the data section (for eLife) will not behave in the same way and is separate and isolated.

michael commented 5 years ago

Tracked in requirements document.

substance / texture

Data availability section #557