How do we version the dataset?

shntnu commented 1 year ago

We do not plan to version the data for now because we haven't thought it through fully (see "Things to keep in mind")

Strawman plan

Level 3 and above data will be versioned using DVC in https://github.com/jump-cellpainting/datasets

This will allow versioning of profiles. Latest version of Profiles will also be available directly on S3
We will use Git-based release management + zenodo / figshare https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content
We will not version lower than level 3.

Counter-argument

But even just Level 3 and above will be > 100Gb!
- Each Level3 is 18Mb and we will have ~2700 such files – this is 50Gb
- Level 4a, 4b will be probably another 50Gb
- Maybe we should version only the Level 4b, and do so as a single collated parquet file

Things to keep in mind

Whatever we do here will also apply to the other datasets in cellpainting-gallery, even if we don't implement it immediately for the other datasets
It's unclear whether we should have the https://github.com/jump-cellpainting/datasets be treated like standard data repo generated using the profiling recipe (e.g., https://github.com/jump-cellpainting/pilot-data-public), or not. Ideally, we do want to treat it the same way. This issue https://github.com/cytomining/profiling-handbook/issues/54 discusses what to version in a standard data repo.
If we decide to release datasets using zenodo/figshare, we should be aware that we might create a confusing set of instructions for citing datasets. Our current instructions are here https://github.com/broadinstitute/cellpainting-gallery#citation, copied below.

All the data will be released with CC0 1.0 Universal (CC0 1.0). However, please cite the appropriate resources/publications, listed below, when citing individual datasets. For example,

We used the dataset cpg0000 (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

But if we create a zenodo entry, we will want people to cite that resource instead (below, Natoli et al., 2021) but still want the corresponding paper (below, Way et al., 2022) and RODA to be cited (below, cpg0004 available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/))

We used the dataset cpg0004 (Way et al., 2022; Natoli et al., 2021), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/).

A simple tack would be to cite only the Zenodo DOI, but that would mean we would no longer cite the Cell Painting Gallery nor the paper, and that's undesirable (we want to credit both)

shntnu commented 1 year ago

This looks promising https://dvc.org/doc/user-guide/data-management/managing-external-data

Or the simpler alternatives suggested on that page:

shntnu commented 1 year ago

Related conversations with Erin Chu Jun, 2022

Shantanu:

How should we acknowledge RODA in future papers, including the one attached? In the past, we've cited a paper linked to the resource e.g. we deposited images in IDR for this paper and said We deposited raw and illumination-corrected images to the Image Data Resource (https://idr.openmicroscopy.org/) under accession number idr0080 (Williams et al., 2017). Williams et al., 2017 is the paper that announced IDR. We are happy to add you to acknowledgments of course, but I thought it will be a lot more substantial to cite an actual DOI of some sort associated with RODA. Maybe you could consider adding a CITATION.md file to https://github.com/awslabs/open-data-registry and you will be all set (and cited!) But maybe it's only we academics who care about that sort of stuff :)

Erin Chu:

I LOVE the idea of adding citation information to GH; I do get asked this somewhat commonly. It might become a For your reference we refer that people use the Registry URL, for example:

"Data are available at registry.opendata.aws/cell-painting." "The Broad Cell Painting Collection was accessed on January 3rd, 2022 from registry.opendata.aws/cell-painting."

This could change in the future as we're considering adding DOIs to datasets for better citability (what are your thoughts on this?), but in the meantime please use the above language, always citing the Registry URL.

Shantanu:

Thanks for clarifying how to cite RODA. We will go with "Data are available at registry.opendata.aws/cell-painting" in most cases until we've figured out DOIs.

Regarding DOI's – it definitely seems the way to go. I can't speak for all of RODA, but for roda/cell-painting, it would be much more useful if we can have separate DOIs for datasets within roda/cell-painting

In this context, It's worth considering the "data flow" I had in mind

For each new Cell Painting dataset that we plan to make public, we will _(Update Feb 2023: our current process is here https://github.com/broadinstitute/cellpainting-gallery/blob/main/.github/ISSUE_TEMPLATE/data-immediately-public.md)_

add a row to https://broad.io/profiling_dataset, a spreadsheet we've been maintaining (updated sporadically right now, but will be more regular once we have streamlined the whole process)
upload all components of the dataset to s3://cellpainting-gallery (images + processed data)
create a page in BBBC, which will have all the narrative around it (e.g. https://bbbc.broadinstitute.org/BBBC021) [Update Feb 2022: we decided not to include this in our process]
submit the dataset to IDR (e.g. http://idr.openmicroscopy.org/webclient/?show=screen-2001 is the IDR entry corresponding to BBBC021; the metadata is available on GitHub https://github.com/IDR/idr-metadata/tree/master/idr0035-caie-drugresponse and then ingested into IDR. They now have a different workflow: they create separate repos for each dataset e.g. search for "idr0080" on the page https://github.com/IDR/idr-metadata and it will point you to a repo for that dataset. Each IDR datasets now has its own doi e.g. https://doi.org/10.17867/10000153)

Ideally, there would be a single DOI that somehow links 2,3,4 but I think that will end up being too complicated. We can instead skip DOIs for the BBBC entry (# 2), have IDR (# 3) generate their DOIs using their own process, and then just create a new process for creating RODA (# 4) DOIs. IDR can then include the RODA DOI as metadata (like they already do for publications – see the panel to the right of the screen on https://doi.org/10.17867/10000153, screenshot below)

bethac07 commented 1 year ago

Discussion outcome -

We used cpg0016 {Chandrasekaran 2023|ZeonodoDOI} hosted at AWS Registry of Open Data.

bethac07 commented 1 year ago

One additional nice thing with Zenodo- you can add a LARGE list of "alternate identifiers" with an even longer list of "how does that alternate identifier relate to this Zenodo object". So linking the Zenodo to the paper to the IDR to the RODA page to the whatever should be straightforward.

I added, for example, the bioRxiv DOI to the Zenodo archiving of the Nat Prot paper protocol repo.

https://zenodo.org/record/7267354#.Y-UhZezMI0Q

shntnu commented 2 months ago

Turns out Synapse might be a good option for our needs here https://www.perplexity.ai/page/comparing-synapse-and-zenodo-Yo3npXDzSqSFEOf9Ocln3g

Update July 26, 2024: I nixed this idea because it doesn't have any advantage over Zenodo, given that we plan to use manifest files (see next comment)

shntnu commented 2 months ago

@afermg and I discussed that using manifest files to version components of the JUMP dataset is the simplest route.

For example, for the "assembled" data (batch corrected, single large parquet file per modalitity), we will create a CSV file that points to the version of the data that we currently recommend using; this file will be versioned using Zenodo. A script within the repository will produce the CSV file, and a GitHub Action will automate the process of uploading new versions to Zenodo, which will create human-readable version numbers.

This does make things a bit fragmented and non-uniform because we may end up creating manifests that are not standard across datasets. However, this is exactly how we do it in publications – we version specific data components we care about.

Note that because s3://cellpainting-gallery has object-level data versioning enabled, we trivially have access to versioning at that level (per object) of granularity.

h/t to @jessica-ewald who talked me out of going down the rabbit-hole of minting DOIs for each object.

We can achieve something similar using Quilt packages, but we didn't want to introduce new dependencies given that the solution seems relatively straightforward. Still we should keep Quilt in mind in case we find ourselves adding more "features" to this system of creating manifests.

shntnu commented 2 months ago

I'll add notes here about our how we've create a citable DOI for the https://github.com/jump-cellpainting/datasets as a whole

We have a zenodo minted DOI for the repo https://github.com/jump-cellpainting/datasets/blob/16ee52ae290a88b6beffa2c32b6ffd3b66f132df/README.md?plain=1#L3
We did this here https://github.com/jump-cellpainting/datasets/pull/47/ and here https://github.com/jump-cellpainting/datasets/pull/45
The zenodo settings page is here https://zenodo.org/account/settings/github/repository/jump-cellpainting/datasets#

I just wish there was some method to update a record created via this process. E.g. this was created https://zenodo.org/records/12983164 when I cut this release https://github.com/jump-cellpainting/datasets/releases/tag/v0.6.0. But then I updated the release notes, but the original release notes that get copied over to https://zenodo.org/records/12983164 cannot be edited IIUC.

afermg commented 2 months ago

Just a quick note: I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets. I'm not the only one by the looks of this https://github.com/geneontology/pipeline/issues/345. If we are not going to be producing new versions very often I'd suggest to just upload them manually.

shntnu commented 2 months ago

I've been trying to code a way to update versions for our profile_index.csv to Zenodo but since they changed the API I'm unable to create new versions of existing datasets.

Oh so you can create new datasets, but not update an existing dataset, using their API? But you can do so manually?

So bizarre

afermg commented 2 months ago

Actually, it's taken some work but I think I found a way to do it. Some parts of the REST api work fine for curl (bash) and some others work fine for Python. Combining both we can get a functional way to automatically re-upload and re-version things :).

shntnu commented 2 months ago

For our notes: @afermg has now implemented this versioning strategy: https://github.com/jump-cellpainting/datasets/pull/121

I will keep this issue open for a bit in case we want to discuss this topic further.

jump-cellpainting / datasets