mc2-center / mc2-center-dcc

Data coordination resources for CCKP (and MC2 in general)
0 stars 0 forks source link

Develop file structures and scripts for automated metadata extraction #54

Open Bankso opened 6 months ago

Bankso commented 6 months ago

This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.

Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.

Goals:

Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.

The two main parts would be the file organization/content requirements and the scripts to extract metadata.

File organization and content requirements

Scripts to extract metadata

For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.

Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.

Bankso commented 6 months ago

I wanted to note that we should also integrate data structure and content validation, potentially via DCQC. Some areas where this applies:

In some cases, it may be necessary to include metadata components that are strictly associated with validation. I think this is a reasonable use case, but we should ensure that QC-related metrics are easy to obtain, before implementing.