Develop file structures and scripts for automated metadata extraction

This issue will need to be broken down further, but I wanted to write everything down together, so it can be reviewed in context.

Short description: we should create a system that automatically extracts metadata from files uploaded to contributor Synapse projects.

Goals:

design, write, test code that extracts metadata from data/auxiliary files and adds them to manifests
design Synapse project folder structures and file content requirements that support automated metadata extraction

Capturing metadata from data or (seemingly random) processing outputs is a non-trivial task that requires significant time and attention, even when someone is familiar with the data type and how it was processed. Providing tools that extract this information and map it into a data model seems beneficial, as it would lower the time, effort, and expertise requirements. As we move towards working with large, complex data sets (like spatial profiling and multiplexed imaging), metadata requirements will continue to become more substantial. This should be addressed, so we can limit the amount of poorly-/un-annotated data that gets deposited in repositories.

The two main parts would be the file organization/content requirements and the scripts to extract metadata.

File organization and content requirements

have defined folder structures/file relationships for assay data stored in Synapse projects
have defined supplemental file content/structure requirements, likely tied to a specific data processing pipeline or method, but as generalizable as possible. These could also be expanded over time, to fit different approaches/protocols

Scripts to extract metadata

generally, this will be a bunch of "find, clean, copy" functions, that take structured input and pull the info requested for the metadata model
in some cases, metadata will correspond to calculated values. These could be calculated on-the-fly, instead of being extracted

For some file types with standardized structure (e.g., OME-TIFF, FASTQ, etc.), automated metadata extraction is an established method, so we can adopt those methods where compatible.

Extracting metadata from auxiliary files (e.g., sample sheets, quality control reports, etc.) generated by instruments and various R/python packages is more difficult, since the file formats and data structure is not necessarily consistent between implementations, but I think this is where we can make the most gains.

mc2-center / mc2-center-dcc

Develop file structures and scripts for automated metadata extraction #54