Open shntnu opened 1 year ago
This looks promising https://dvc.org/doc/user-guide/data-management/managing-external-data
Or the simpler alternatives suggested on that page:
Related conversations with Erin Chu Jun, 2022
Shantanu:
How should we acknowledge RODA in future papers, including the one attached? In the past, we've cited a paper linked to the resource e.g. we deposited images in IDR for this paper and said We deposited raw and illumination-corrected images to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0080 (Williams et al., 2017). Williams et al., 2017 is the paper that announced IDR. We are happy to add you to acknowledgments of course, but I thought it will be a lot more substantial to cite an actual DOI of some sort associated with RODA. Maybe you could consider adding a CITATION.md file to https://github.com/awslabs/open-data-registry and you will be all set (and cited!) But maybe it's only we academics who care about that sort of stuff :)
Erin Chu:
I LOVE the idea of adding citation information to GH; I do get asked this somewhat commonly. It might become a For your reference we refer that people use the Registry URL, for example:
"Data are available at registry.opendata.aws/cell-painting." "The Broad Cell Painting Collection was accessed on January 3rd, 2022 from registry.opendata.aws/cell-painting."
This could change in the future as we're considering adding DOIs to datasets for better citability (what are your thoughts on this?), but in the meantime please use the above language, always citing the Registry URL.
Shantanu:
Thanks for clarifying how to cite RODA. We will go with "Data are available at registry.opendata.aws/cell-painting" in most cases until we've figured out DOIs.
Regarding DOI's – it definitely seems the way to go. I can't speak for all of RODA, but for roda/cell-painting, it would be much more useful if we can have separate DOIs for datasets within roda/cell-painting
In this context, It's worth considering the "data flow" I had in mind
For each new Cell Painting dataset that we plan to make public, we will _(Update Feb 2023: our current process is here https://github.com/broadinstitute/cellpainting-gallery/blob/main/.github/ISSUE_TEMPLATE/data-immediately-public.md)_
Ideally, there would be a single DOI that somehow links 2,3,4 but I think that will end up being too complicated. We can instead skip DOIs for the BBBC entry (# 2), have IDR (# 3) generate their DOIs using their own process, and then just create a new process for creating RODA (# 4) DOIs. IDR can then include the RODA DOI as metadata (like they already do for publications – see the panel to the right of the screen on https://doi.org/10.17867/10000153, screenshot below)
Discussion outcome -
We used cpg0016 {Chandrasekaran 2023|ZeonodoDOI} hosted at AWS Registry of Open Data.
One additional nice thing with Zenodo- you can add a LARGE list of "alternate identifiers" with an even longer list of "how does that alternate identifier relate to this Zenodo object". So linking the Zenodo to the paper to the IDR to the RODA page to the whatever should be straightforward.
I added, for example, the bioRxiv DOI to the Zenodo archiving of the Nat Prot paper protocol repo.
Turns out Synapse might be a good option for our needs here https://www.perplexity.ai/page/comparing-synapse-and-zenodo-Yo3npXDzSqSFEOf9Ocln3g
Update July 26, 2024: I nixed this idea because it doesn't have any advantage over Zenodo, given that we plan to use manifest files (see next comment)
@afermg and I discussed that using manifest files to version components of the JUMP dataset is the simplest route.
For example, for the "assembled" data (batch corrected, single large parquet file per modalitity), we will create a CSV file that points to the version of the data that we currently recommend using; this file will be versioned using Zenodo. A script within the repository will produce the CSV file, and a GitHub Action will automate the process of uploading new versions to Zenodo, which will create human-readable version numbers.
This does make things a bit fragmented and non-uniform because we may end up creating manifests that are not standard across datasets. However, this is exactly how we do it in publications – we version specific data components we care about.
Note that because s3://cellpainting-gallery has object-level data versioning enabled, we trivially have access to versioning at that level (per object) of granularity.
h/t to @jessica-ewald who talked me out of going down the rabbit-hole of minting DOIs for each object.
We can achieve something similar using Quilt packages, but we didn't want to introduce new dependencies given that the solution seems relatively straightforward. Still we should keep Quilt in mind in case we find ourselves adding more "features" to this system of creating manifests.
I'll add notes here about our how we've create a citable DOI for the https://github.com/jump-cellpainting/datasets as a whole
I just wish there was some method to update a record created via this process. E.g. this was created https://zenodo.org/records/12983164 when I cut this release https://github.com/jump-cellpainting/datasets/releases/tag/v0.6.0. But then I updated the release notes, but the original release notes that get copied over to https://zenodo.org/records/12983164 cannot be edited IIUC.
Just a quick note: I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets. I'm not the only one by the looks of this https://github.com/geneontology/pipeline/issues/345. If we are not going to be producing new versions very often I'd suggest to just upload them manually.
I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets.
Oh so you can create new datasets, but not update an existing dataset, using their API? But you can do so manually?
So bizarre
Actually, it's taken some work but I think I found a way to do it. Some parts of the REST api work fine for curl (bash) and some others work fine for Python. Combining both we can get a functional way to automatically re-upload and re-version things :).
For our notes: @afermg has now implemented this versioning strategy: https://github.com/jump-cellpainting/datasets/pull/121
I will keep this issue open for a bit in case we want to discuss this topic further.
We do not plan to version the data for now because we haven't thought it through fully (see "Things to keep in mind")
Strawman plan
Level 3 and above data will be versioned using DVC in https://github.com/jump-cellpainting/datasets
Counter-argument
Things to keep in mind
All the data will be released with CC0 1.0 Universal (CC0 1.0). However, please cite the appropriate resources/publications, listed below, when citing individual datasets. For example,