jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
155 stars 16 forks source link

How do we version the dataset? #126

Open shntnu opened 1 year ago

shntnu commented 1 year ago

We do not plan to version the data for now because we haven't thought it through fully (see "Things to keep in mind")

Strawman plan

Level 3 and above data will be versioned using DVC in https://github.com/jump-cellpainting/datasets

Counter-argument

Things to keep in mind

All the data will be released with CC0 1.0 Universal (CC0 1.0). However, please cite the appropriate resources/publications, listed below, when citing individual datasets. For example,

We used the dataset cpg0000 (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

We used the dataset cpg0004 (Way et al., 2022; Natoli et al., 2021), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

shntnu commented 1 year ago

This looks promising https://dvc.org/doc/user-guide/data-management/managing-external-data

Or the simpler alternatives suggested on that page:

shntnu commented 1 year ago

Related conversations with Erin Chu Jun, 2022

Shantanu:

How should we acknowledge RODA in future papers, including the one attached? In the past, we've cited a paper linked to the resource e.g. we deposited images in IDR for this paper and said We deposited raw and illumination-corrected images to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0080 (Williams et al., 2017). Williams et al., 2017 is the paper that announced IDR. We are happy to add you to acknowledgments of course, but I thought it will be a lot more substantial to cite an actual DOI of some sort associated with RODA. Maybe you could consider adding a CITATION.md file to https://github.com/awslabs/open-data-registry and you will be all set (and cited!) But maybe it's only we academics who care about that sort of stuff :)

Erin Chu:

I LOVE the idea of adding citation information to GH; I do get asked this somewhat commonly. It might become a For your reference we refer that people use the Registry URL, for example:

"Data are available at registry.opendata.aws/cell-painting." "The Broad Cell Painting Collection was accessed on January 3rd, 2022 from registry.opendata.aws/cell-painting."

This could change in the future as we're considering adding DOIs to datasets for better citability (what are your thoughts on this?), but in the meantime please use the above language, always citing the Registry URL.

Shantanu:

Thanks for clarifying how to cite RODA. We will go with "Data are available at registry.opendata.aws/cell-painting" in most cases until we've figured out DOIs.

Regarding DOI's – it definitely seems the way to go. I can't speak for all of RODA, but for roda/cell-painting, it would be much more useful if we can have separate DOIs for datasets within roda/cell-painting

In this context, It's worth considering the "data flow" I had in mind

For each new Cell Painting dataset that we plan to make public, we will _(Update Feb 2023: our current process is here https://github.com/broadinstitute/cellpainting-gallery/blob/main/.github/ISSUE_TEMPLATE/data-immediately-public.md)_

  1. add a row to https://broad.io/profiling_dataset, a spreadsheet we've been maintaining (updated sporadically right now, but will be more regular once we have streamlined the whole process)
  2. upload all components of the dataset to s3://cellpainting-gallery (images + processed data)
  3. create a page in BBBC, which will have all the narrative around it (e.g. https://bbbc.broadinstitute.org/BBBC021) [Update Feb 2022: we decided not to include this in our process]
  4. submit the dataset to IDR (e.g. http://idr.openmicroscopy.org/webclient/?show=screen-2001 is the IDR entry corresponding to BBBC021; the metadata is available on GitHub https://github.com/IDR/idr-metadata/tree/master/idr0035-caie-drugresponse and then ingested into IDR. They now have a different workflow: they create separate repos for each dataset e.g. search for "idr0080" on the page https://github.com/IDR/idr-metadata and it will point you to a repo for that dataset. Each IDR datasets now has its own doi e.g. https://doi.org/10.17867/10000153)

Ideally, there would be a single DOI that somehow links 2,3,4 but I think that will end up being too complicated. We can instead skip DOIs for the BBBC entry (# 2), have IDR (# 3) generate their DOIs using their own process, and then just create a new process for creating RODA (# 4) DOIs. IDR can then include the RODA DOI as metadata (like they already do for publications – see the panel to the right of the screen on https://doi.org/10.17867/10000153, screenshot below)

bethac07 commented 1 year ago

Discussion outcome -

We used cpg0016 {Chandrasekaran 2023|ZeonodoDOI} hosted at AWS Registry of Open Data.

bethac07 commented 1 year ago

One additional nice thing with Zenodo- you can add a LARGE list of "alternate identifiers" with an even longer list of "how does that alternate identifier relate to this Zenodo object". So linking the Zenodo to the paper to the IDR to the RODA page to the whatever should be straightforward.

image image

I added, for example, the bioRxiv DOI to the Zenodo archiving of the Nat Prot paper protocol repo.

https://zenodo.org/record/7267354#.Y-UhZezMI0Q

shntnu commented 2 months ago

Turns out Synapse might be a good option for our needs here https://www.perplexity.ai/page/comparing-synapse-and-zenodo-Yo3npXDzSqSFEOf9Ocln3g

Update July 26, 2024: I nixed this idea because it doesn't have any advantage over Zenodo, given that we plan to use manifest files (see next comment)

shntnu commented 2 months ago

@afermg and I discussed that using manifest files to version components of the JUMP dataset is the simplest route.

For example, for the "assembled" data (batch corrected, single large parquet file per modalitity), we will create a CSV file that points to the version of the data that we currently recommend using; this file will be versioned using Zenodo. A script within the repository will produce the CSV file, and a GitHub Action will automate the process of uploading new versions to Zenodo, which will create human-readable version numbers.

This does make things a bit fragmented and non-uniform because we may end up creating manifests that are not standard across datasets. However, this is exactly how we do it in publications – we version specific data components we care about.

Note that because s3://cellpainting-gallery has object-level data versioning enabled, we trivially have access to versioning at that level (per object) of granularity.

h/t to @jessica-ewald who talked me out of going down the rabbit-hole of minting DOIs for each object.


We can achieve something similar using Quilt packages, but we didn't want to introduce new dependencies given that the solution seems relatively straightforward. Still we should keep Quilt in mind in case we find ourselves adding more "features" to this system of creating manifests.

shntnu commented 2 months ago

I'll add notes here about our how we've create a citable DOI for the https://github.com/jump-cellpainting/datasets as a whole

I just wish there was some method to update a record created via this process. E.g. this was created https://zenodo.org/records/12983164 when I cut this release https://github.com/jump-cellpainting/datasets/releases/tag/v0.6.0. But then I updated the release notes, but the original release notes that get copied over to https://zenodo.org/records/12983164 cannot be edited IIUC.

afermg commented 2 months ago

Just a quick note: I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets. I'm not the only one by the looks of this https://github.com/geneontology/pipeline/issues/345. If we are not going to be producing new versions very often I'd suggest to just upload them manually.

shntnu commented 2 months ago

I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets.

Oh so you can create new datasets, but not update an existing dataset, using their API? But you can do so manually?

So bizarre

afermg commented 2 months ago

Actually, it's taken some work but I think I found a way to do it. Some parts of the REST api work fine for curl (bash) and some others work fine for Python. Combining both we can get a functional way to automatically re-upload and re-version things :).

shntnu commented 2 months ago

For our notes: @afermg has now implemented this versioning strategy: https://github.com/jump-cellpainting/datasets/pull/121

I will keep this issue open for a bit in case we want to discuss this topic further.