Closed noamross closed 3 years ago
This sounds great @noamross. There's a ton of work to highlight, and tradeoffs with each of the choices. I'd be happy to help with this if the timing works out.
@noamross & others -- who do you see as the main audience for this? Researchers who have never deposited data in a DOI-granting repository before? Or people who primarily want to start discovering & consume data from such repos? Or perhaps those who have some experience with either of these but are trying to go to the next level?
I'm certainly biased here, but I think the usual focus is on the first of these (though for good reason!) with almost nothing on the last of these. Not saying to focus on 'advanced users' necessarily, but just wanted to open it up for discussion.
Excellent point @cboettig. IMO people come to our community calls with range of novice to expert perspectives and expectations.
Maybe we could aim to provide brief intro/novice info and focus the balance on more advanced users. And we provide, prior to the call, a short curated list of resources that already address the intro information.
Important for us/rOpenSci to recognize the audience sector with unmet needs that intersect with what rOpenSci community expertise can provide that others might not.
I'm a little more inclined to aim this towards advanced users for reasons of prompting community to participate in build out of more tools. Rather than focus on explaining the general idea of data deposition, I'd rather highlight the differences in features and how those interact with workflows and the state of current tooling.
prompting community to participate in build out of more tools
identifying unmet needs sounds cool
potential speaker suggested by Karthik: Daniella Lowenberg @dlowenberg, Product Manager Dryad, Project Lead Make Data Count
I would be happy to talk about the new Dryad on a community call
Thank you @dlowenberg! I'm currently steeped in organizing a call on "maintaining an R package" and will get in touch once that's settled
Karthik and I have been thinking about this recently and have written a paper on the topic of data sharing (pre-print here)
One of the key issues I find not-yet well articulated is the difference between data sharing and data quality. I think these are two separate, but related things.
An analogy to this is the difference between marketing a product, and the quality of a product.
So for data, I think we need to understand the components of data sharing, and how they are different, and how they are related to data quality.
@njtierney :clap: This is such a good point, and I think goes to the heart of many challenges we see in this area.
For instance, it is often suggested that having good/rich/machine readable metadata is key, and indeed, different metadata models are perhaps the biggest difference between repositories (which are otherwise just URLs you upload & download data, right?) But while there's obviously overlap, some metadata are much more aimed at supporting the 'marketing' / discoverability tasks, while others are aimed at making it easier to use.
It's also important to know what discoverability model we have in mind. The data provider might only be interested in making data 'discoverable' to people reading their paper -- you read my paper, you see I say 'data on Zenodo at DOI... and get it there'. that's often at odds with a vision that says we should be able to discover and re-use (some part of) the data without ever reading (or knowing about?) the paper
Looking forward to this community call!
I'd be interested in where proposed solutions (e.g. Zenodo) fit into the data size vs. access matrix:
public data | private data | |
---|---|---|
small data | ||
large data |
large data = millions of data entries
I like the matrix you have there @sinarueeger !
One idea I am interested in is how folks see data documentation - specifically, is a plain data dictionary different/separate or the same as machine readable metadata? In my mind they were different, but thinking further, I'm not sure if this is entirely true. Do you have any thoughts on this @cboettig
@njtierney I agree they are different, but mainly in terms of use cases -- the content should largely be the same albeit represented differently.
There are certainly both machine readable and human readable data dictionaries, and you can create the latter from the former. Metadata languages like EML (EML R package) and ISO 19115-2 (geometa R package) provide a machine-readable structure for describing all of the attributes/variables in a data set and structured info on their units, etc. (e.g., from the Arctic Data Center: https://doi.org/10.18739/A2RX93D3S). In EML we also can tie these to well-known semantic measurement type vocabularies to help mediate the ambiguities of using natural language to name and define variables that were measured. Researchers often use variable names like 'soil15' when what they might actually be measuring is "Soil Temperature' in units of degrees celsius. We map those to well-defined measurement types. We then use that to create both human-readable displays (e.g., on dataset web pages) as well as searchable indices that use the vocabularies to be create precise searches for specific types of measurements (e.g., find all datasets and only those datasets that measure flux of photosynthetically active radiation). We have a very brief writeup of our semantic search system for researchers, or you can try it out at https://search.dataone.org.
I think another axis to consider is the richness of the metadata model used by a repository. Lot's of repositories, including Zenodo, use abbreviated metadata models, mainly because it makes the data deposit process simpler, with a major tradeoff in reusability/interpretability. Others use much richer metadata models that include both search and discovery metadata, data structural information, data dictionaries, methodological information, and more. We've been quantitatively comparing the metadata richness across the ~45 DataONE repositories against the FAIR principles, and the differences in metadata richness are huge. Happy to go into details if you'd like.
Relevant rOpenSci peer-reviewed tools
rdryad is a package to interface with the Dryad data repository. Scott, Karthik, Carl
osfr provides a suite of functions for interacting with the Open Science Framework (OSF). Aaron Wolen
arkdb chunk large data from flat text files to these lite databases like MonetDB, SQLite, without running into memory limitations. Carl Boettiger
piggyback allows for uploading and downloading such files to GitHub releases, making it easy for anyone to access data files wherever the script is being run. Boettiger, (2018). Managing Larger Data on a GitHub Repository. Journal of Open Source Software, 3(29), 971, https://doi.org/10.21105/joss.00971
Back to Noam's comment, what R tools need to be developed?
I'm a little more inclined to aim this towards advanced users for reasons of prompting community to participate in build out of more tools. Rather than focus on explaining the general idea of data deposition, I'd rather highlight the differences in features and how those interact with workflows and the state of current tooling.
Some resources
New Dryad is Here - post that inspired suggestion of this topic
A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility - preprint by Tierney & Ram
Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416.
Sholler D, Ram K, Boettiger C, Katz DS. Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society. January 2019. doi:10.1177/2053951719836258
Community Call happening Wed Dec 16, 10-11am Pacific. Panel: Kara Woo, Daniella Lowenberg, Matt Jones, Carl Boettiger, Karthik Ram
Where to deposit data, Challenges in data deposition for reuse, Where are the tools & documentation gaps, lots of Q&A time.
Details & add to your calendar: https://ropensci.org/commcalls/dec2020-datarepos/ Tweet to share: https://twitter.com/rOpenSci/status/1329092004496748545
potentially relevant rOpenSci packages:
Community Call completed with 163 attendees! Video with subtitles, notes doc, resources all at https://ropensci.org/commcalls/dec2020-datarepos/
followup to some of the discussion The Dryad and Zenodo teams are proud to announce the launch of our first formal integration.
Topic
The Wild World of Data Repositories
Who is the audience?
Researchers who need or want to share data via a public data repository or use data in repositories.
Why is this important?
Long term-archival data sharing is essential for open science and should be built into project workflows.
What should be covered?
There are an ever increasing number of data repositories with varying features, requirements, and topic specificity. Some of the more general ones are OSF, Dryad, Zenodo, and Figshare. There are a number of DataONE repositories including KNB. What are the comparative advantages and disadvantages of each (Private sharing, pre-release, storage size, APIs, metadata, discoverability, service/publisher integrations, etc.)? What are the relevant tools and workflows (including R packages) for preparing, depositing, or pulling data from them?
Suggested speakers or contributors
@mbjones (KNB), (@aaronwolen (OSF), @karthik or @cboettig? But maybe someone else would be better for an overview.
Resources you would recommend to the audience
Hmmm, this was inspired by Dryad's post on it's updated system. I'll think on this.