Standardized metadata - Githubissues

jeffkaufman commented 1 month ago

@simonleandergrimm and I were working on getting the output of the pipeline into something that https://github.com/naobservatory/p2ra can read, and one thing we missed from the v1 pipeline is format standardization. For example, some of the date fields are formatted dd/mm/yyyy and others are yyyy-mm-dd, or panel enrichment is sometimes indicated with a 1 vs 0 and other times with enriched and unenriched.

Much more efficient to clean this up once than have each consumer need to handle the variety of formats.

willbradshaw commented 2 weeks ago

How would you suggest handling this?

Currently the pipeline doesn't do anything with the metadata, just passes it through, so it's up to the user how they want to specify this. It's easy enough to provide guidelines in the README, but I'm sceptical about trying to explicitly handle this when (a) it doesn't affect anything in the pipeline itself and (b) there are so many possible metadata fields.

mikemc commented 2 weeks ago

I agree that this doesn't seem like something to be handled within this pipeline. Possibly we're seeing a difference in scope for the v1 and v2 pipelines. The v2 pipeline (this one) just analyzes MGS data at the sample level using sample names, but the v1 pipeline also has some additional use of sample metadata for the dashboard. It seems best to me to leave that out of this repo as Will currently has done things, and leave it up to users to come up with their own metadata standards and ways of using it.

We can then separately discuss how we might benefit from standardizing metadata, but I'm guessing we all have our own opinions around how to do this and so it's not worth trying to do this across teams and projects (though important to do within projects) when it's so easy to change variable encodings or column names with a bit of code.

jeffkaufman commented 2 weeks ago

Possibly we're seeing a difference in scope for the v1 and v2 pipelines.

I think that's right; thanks for pointing it out! I thought of the scope of the v1 pipeline as:

Import data into S3 (from SRA or our various partners)
Process the data
Emit metadata in a consistent format
Prepare summary files in a format the dashboard can display

I knew the v2 pipeline wasn't trying to do (4), and I've filed #25 to decide about (1). For (3), this issue, I think it's pretty valuable to include in a central location as something we update anytime we import data, but I don't think it needs to necessarily be in this repo or be seen as part of this pipeline.

One option would be that we could keep using this part of the v1 pipeline?

mikemc commented 1 week ago

In general I agree with Will's approach of leaving it to the user of the pipeline to prepare and format metadata how they wish, and more generally separating the pipeline (this repo) from the application of the pipeline.

My suggestion would be to have a separate (private) repo that has the metadata (raw as well as cleaned and standardized), code used to run the pipeline on all of our shared datasets (including info on pipeline version, config files, workflow files if these are customized), and the s3 output locations. This is also where I'd suggest having any code that does items 3 and 4 in Jeff's list.

It could make sense to have nextflow worfklows/processes to do dashboard stuff in this repo, leaving these as an optional step in the main workflow; or have them defined in the above repo or a repo with dashboard code. Either way I think we'll want to have our own private repo where we have the info needed to reproduce what we did (pipeline versions and customized configs and workflow.nf files, and input including metadata)

willbradshaw commented 1 week ago

One option would be that we could keep using this part of the v1 pipeline?

@jeffkaufman how exactly does the v1 pipeline handle this?

jeffkaufman commented 1 week ago

In the v1 pipeline, each project has a metadata.tsv and a chunk of code in either https://github.com/naobservatory/mgs-pipeline/blob/main/dashboard/sample_metadata_classifier.py which interprets the metadata tsv, cleans it up, and adds additional information that doesn't vary between samples in the project. Then this gets written to https://github.com/naobservatory/mgs-pipeline/blob/main/dashboard/metadata_samples.json

(The same pattern is also used for mgs-restricted)

willbradshaw commented 1 week ago

What cleanup does sample_metadata_classifier.py do? On an initial skim, I'm not seeing anything that would prevent me from adding say, enrichment as a boolean in one dataset and a numeric in another, or dates in two different formats.

(I'd also really like to avoid this workflow having any hardcoded opinions about specific datasets.)

jeffkaufman commented 1 week ago

In https://github.com/naobservatory/mgs-pipeline/blob/main/dashboard/sample_metadata_classifier.py there's mapping between dataset specific IDs and counties, date format cleanups, and non-machine-readable notes. In https://github.com/naobservatory/mgs-restricted/blob/main/dashboard/sample_metadata_classifier.py there are additional cleanups and a lot of non-machine-readable information about the specific datasets.

It's not important to me that this be in this workflow in particular, but I do want it to be checked into some centralized location.

naobservatory / mgs-workflow

Standardized metadata #19