mc2-center / mc2-center-dcc

Data coordination resources for CCKP (and MC2 in general)
0 stars 0 forks source link

Design and test process for creating and sharing files + metadata with Synapse Datasets #71

Closed Bankso closed 1 week ago

Bankso commented 2 months ago

Files and metadata stored or linked in Synapse can be organized into Datasets. When selected for download, Datasets should include the entire set of files and metadata necessary for reuse.

The current strategy for building Datasets looks something like:

  1. Create folders to store each type of file + auxiliary files (qc, run config, etc.)
  2. Upload files into Synapse
  3. Apply metadata annotations to files
  4. Pull files and annotations into Datasets
  5. Create Collections of Datasets

Outputs from steps 4 and 5 would be viable for release on the CCKP. Access restrictions associated with each file type are carried into the Dataset.

Related to #54, we should aim to have folder structure requirements that enable automated creation of per-datatype Datasets and corresponding Collections.

Bankso commented 1 month ago

Composing a structured and generalized process for building Synapse Datasets Basic process: Organize --> Upload/Annotate --> Create Datasets and Collections --> Release

I'm considering options at each step of the process and trying to define where flexibility is needed vs where we can be strict.

I think an ideal process would look something like this (or at least include some of these steps):

  1. Dataset entities are provisioned, according to sharing plan

    a. I imagine the most common organization will be to create Datasets for each study + assay type + processing level + species. Additional stratification could be performed based on tumor type, tissue type, model/individual, biospecimen, timepoint or timeseries, etc.

    b. Dataset Collections will be composed of primary Datasets

    c. For model/individual/biospecimen information stored as schematic CSVs/Synapse Tables, this will be uploaded to a shared top level folder. We can 1) pull the specific file version with the corresponding metadata into a Dataset OR 2) we can run a table query on the schematic manifest Table, create a CSV, upload it to Synapse, and use that in the Dataset OR 3) metadata pulled from a Table can be applied as file annotations, which would allow them to be captured in the Dataset itself

  2. Contributor uses synapse manifest to create folders in their Synapse project and an upload TSV

  3. Apply a set of minimal file annotations at upload, by 1) manually adding columns to the TSV OR 2) enabling the creation of specific TSVs through synapse manifest (maybe based on schematic data models?) OR 3) providing template TSVs that can be tacked on to the synapse manifest path/parent TSV

    a. Annotations could be any subset of: Study key, Biospecimen key, Individual key, Model key, Dataset key, assay, processing level/file type, file format, species, tumor type, tissue type, DUO code, etc., depending on the assay type and provisioned Datasets

  4. Have a script or set of scripts that 1) identifies files with matching annotations and adds them to a Dataset, alongside their annotations, 2) creates Dataset Collections, 3) mints a DOI (is this possible?) using information from the Study entry matching the Study key, 4) generates a DatasetView manifest, pre-populated with annotations for each Dataset and Collection

  5. At any point, additional metadata can be applied to the files through the DCA. This is important, since we won't have all the models ready at the time data is published. New annotations can be added to Datasets and new versions can be created and released.

  6. Have contributors submit their DatasetView manifest through the DCA as the final step before release. Metadata can be on the CCKP immediately, since permissions will prevent anyone from accessing the data until modified

  7. DatasetView submission will be the formal flag to the MC2 Center that the contributor is ready to release the Dataset(s) a. MC2 Center will confirm which files/folders should be Restricted Access, Open Access, and Anonymous Access b. MC2 Center will help assign appropriate access restrictions (w/ ACT)

Bankso commented 1 month ago

The tricky part about the process I defined above is how we apply annotations through schematic/DCA

Schematic is folder-oriented, which makes it difficult to apply annotations unless all the files you want to annotate are in the same folder. Two ways this could be addressed:

  1. Use a script to reorganize files into individual folders that correspond to Datasets, to which annotations can be applied by schematic/DCA. This could be done as top-level folders OR we could switch to using contentType: dataset annotations on all folders, which would allow us to use nested folders. Annotations would then be pulled into Datasets.
  2. Schematic/DCA could support annotating files listed in a Synapse Dataset
aclayton555 commented 1 week ago

24-7/8 close-out: key next steps is to figure out the actual dataset annotations. Ongoing discussions on this with data modelling group, but currently can't use schematic to do this. So: 1) what is the schema; 2) implementation as the JSON; 3) exploration and incorporation of automation; and 4) longer term how this will look on the portal and what the expected user experience will be.

Agree to close this out within this sprint as an exploration, and break out the above into individual longer term work items. Orion to work and onboard Aditya in this effort as a compliment to ongoing work in the NF space. Also think about the data type specific contextualization for datasets (e.g. a seq dataset may look different from an imaging dataset).

Note: Avoid confusion of our existing "Dataset" component vs. "Synapse Dataset" vs. "Collection"