noamross commented 5 years ago

Topic

The Wild World of Data Repositories

Who is the audience?

Researchers who need or want to share data via a public data repository or use data in repositories.

Why is this important?

Long term-archival data sharing is essential for open science and should be built into project workflows.

What should be covered?

There are an ever increasing number of data repositories with varying features, requirements, and topic specificity. Some of the more general ones are OSF, Dryad, Zenodo, and Figshare. There are a number of DataONE repositories including KNB. What are the comparative advantages and disadvantages of each (Private sharing, pre-release, storage size, APIs, metadata, discoverability, service/publisher integrations, etc.)? What are the relevant tools and workflows (including R packages) for preparing, depositing, or pulling data from them?

Suggested speakers or contributors

@mbjones (KNB), (@aaronwolen (OSF), @karthik or @cboettig? But maybe someone else would be better for an overview.

Resources you would recommend to the audience

Hmmm, this was inspired by Dryad's post on it's updated system. I'll think on this.

mbjones commented 5 years ago

This sounds great @noamross. There's a ton of work to highlight, and tradeoffs with each of the choices. I'd be happy to help with this if the timing works out.

cboettig commented 5 years ago

@noamross & others -- who do you see as the main audience for this? Researchers who have never deposited data in a DOI-granting repository before? Or people who primarily want to start discovering & consume data from such repos? Or perhaps those who have some experience with either of these but are trying to go to the next level?

I'm certainly biased here, but I think the usual focus is on the first of these (though for good reason!) with almost nothing on the last of these. Not saying to focus on 'advanced users' necessarily, but just wanted to open it up for discussion.

stefaniebutland commented 5 years ago

Excellent point @cboettig. IMO people come to our community calls with range of novice to expert perspectives and expectations.

Maybe we could aim to provide brief intro/novice info and focus the balance on more advanced users. And we provide, prior to the call, a short curated list of resources that already address the intro information.

Important for us/rOpenSci to recognize the audience sector with unmet needs that intersect with what rOpenSci community expertise can provide that others might not.

noamross commented 5 years ago

I'm a little more inclined to aim this towards advanced users for reasons of prompting community to participate in build out of more tools. Rather than focus on explaining the general idea of data deposition, I'd rather highlight the differences in features and how those interact with workflows and the state of current tooling.

stefaniebutland commented 5 years ago

prompting community to participate in build out of more tools

identifying unmet needs sounds cool

stefaniebutland commented 4 years ago

potential speaker suggested by Karthik: Daniella Lowenberg @dlowenberg, Product Manager Dryad, Project Lead Make Data Count

dlowenberg commented 4 years ago

I would be happy to talk about the new Dryad on a community call

stefaniebutland commented 4 years ago

Thank you @dlowenberg! I'm currently steeped in organizing a call on "maintaining an R package" and will get in touch once that's settled

njtierney commented 4 years ago

Karthik and I have been thinking about this recently and have written a paper on the topic of data sharing (pre-print here)

One of the key issues I find not-yet well articulated is the difference between data sharing and data quality. I think these are two separate, but related things.

An analogy to this is the difference between marketing a product, and the quality of a product.

Without good marketing, it can be hard to know it exists
Without a good product, it can be hard to use / consume

So for data, I think we need to understand the components of data sharing, and how they are different, and how they are related to data quality.

What I mean by data quality: to me this refers to things like "are there data entry errors", "are the data values the correct type (e.g., dates, not numbers, numbers not character, etc", issues with missing data, having the data in tidy format
What I mean by data sharing: How easy is it for someone else to find the data online, download it, and understand what the contents are, and then use it in an analysis, and potentially combine with another dataset.

cboettig commented 4 years ago

@njtierney :clap: This is such a good point, and I think goes to the heart of many challenges we see in this area.

For instance, it is often suggested that having good/rich/machine readable metadata is key, and indeed, different metadata models are perhaps the biggest difference between repositories (which are otherwise just URLs you upload & download data, right?) But while there's obviously overlap, some metadata are much more aimed at supporting the 'marketing' / discoverability tasks, while others are aimed at making it easier to use.

It's also important to know what discoverability model we have in mind. The data provider might only be interested in making data 'discoverable' to people reading their paper -- you read my paper, you see I say 'data on Zenodo at DOI... and get it there'. that's often at odds with a vision that says we should be able to discover and re-use (some part of) the data without ever reading (or knowing about?) the paper

sinarueeger commented 4 years ago

Looking forward to this community call!

I'd be interested in where proposed solutions (e.g. Zenodo) fit into the data size vs. access matrix:

	public data	private data
small data
large data

large data = millions of data entries

njtierney commented 4 years ago

I like the matrix you have there @sinarueeger !

One idea I am interested in is how folks see data documentation - specifically, is a plain data dictionary different/separate or the same as machine readable metadata? In my mind they were different, but thinking further, I'm not sure if this is entirely true. Do you have any thoughts on this @cboettig

mbjones commented 4 years ago

@njtierney I agree they are different, but mainly in terms of use cases -- the content should largely be the same albeit represented differently.

There are certainly both machine readable and human readable data dictionaries, and you can create the latter from the former. Metadata languages like EML (EML R package) and ISO 19115-2 (geometa R package) provide a machine-readable structure for describing all of the attributes/variables in a data set and structured info on their units, etc. (e.g., from the Arctic Data Center: https://doi.org/10.18739/A2RX93D3S). In EML we also can tie these to well-known semantic measurement type vocabularies to help mediate the ambiguities of using natural language to name and define variables that were measured. Researchers often use variable names like 'soil15' when what they might actually be measuring is "Soil Temperature' in units of degrees celsius. We map those to well-defined measurement types. We then use that to create both human-readable displays (e.g., on dataset web pages) as well as searchable indices that use the vocabularies to be create precise searches for specific types of measurements (e.g., find all datasets and only those datasets that measure flux of photosynthetically active radiation). We have a very brief writeup of our semantic search system for researchers, or you can try it out at https://search.dataone.org.

I think another axis to consider is the richness of the metadata model used by a repository. Lot's of repositories, including Zenodo, use abbreviated metadata models, mainly because it makes the data deposit process simpler, with a major tradeoff in reusability/interpretability. Others use much richer metadata models that include both search and discovery metadata, data structural information, data dictionaries, methodological information, and more. We've been quantitatively comparing the metadata richness across the ~45 DataONE repositories against the FAIR principles, and the differences in metadata richness are huge. Happy to go into details if you'd like.

stefaniebutland commented 4 years ago

Relevant rOpenSci peer-reviewed tools

rdryad is a package to interface with the Dryad data repository. Scott, Karthik, Carl

osfr provides a suite of functions for interacting with the Open Science Framework (OSF). Aaron Wolen

arkdb chunk large data from flat text files to these lite databases like MonetDB, SQLite, without running into memory limitations. Carl Boettiger

piggyback allows for uploading and downloading such files to GitHub releases, making it easy for anyone to access data files wherever the script is being run. Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971, https://doi.org/10.21105/joss.00971

Back to Noam's comment, what R tools need to be developed?

I'm a little more inclined to aim this towards advanced users for reasons of prompting community to participate in build out of more tools. Rather than focus on explaining the general idea of data deposition, I'd rather highlight the differences in features and how those interact with workflows and the state of current tooling.

stefaniebutland commented 4 years ago

Some resources

New Dryad is Here - post that inspired suggestion of this topic

A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility - preprint by Tierney & Ram

Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416.

Sholler D, Ram K, Boettiger C, Katz DS. Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society. January 2019. doi:10.1177/2053951719836258

stefaniebutland commented 4 years ago

Community Call happening Wed Dec 16, 10-11am Pacific. Panel: Kara Woo, Daniella Lowenberg, Matt Jones, Carl Boettiger, Karthik Ram

Where to deposit data, Challenges in data deposition for reuse, Where are the tools & documentation gaps, lots of Q&A time.

Details & add to your calendar: https://ropensci.org/commcalls/dec2020-datarepos/ Tweet to share: https://twitter.com/rOpenSci/status/1329092004496748545

stefaniebutland commented 3 years ago

potentially relevant rOpenSci packages:

stefaniebutland commented 3 years ago

Community Call completed with 163 attendees! Video with subtitles, notes doc, resources all at https://ropensci.org/commcalls/dec2020-datarepos/

stefaniebutland commented 3 years ago

followup to some of the discussion The Dryad and Zenodo teams are proud to announce the launch of our first formal integration.

ropensci-org / community-calls

The Wild World of Data Repositories #9

Topic

Who is the audience?

Why is this important?

What should be covered?

Suggested speakers or contributors

Resources you would recommend to the audience