hubmapconsortium / ingest-pipeline

Data ingest pipeline(s) for QA/metadata etl/post-processing
MIT License
4 stars 5 forks source link

add OME-TIFF validation to ingest process #133

Closed ngehlenborg closed 3 years ago

ngehlenborg commented 4 years ago

https://docs.openmicroscopy.org/bio-formats/6.5.1/users/comlinetools/xml-validation.html

@ilan-gold: is there anything else that you are checking for?

Current status: validation process checks that OME-TIFF is valid XML.

Next update will check OME-TIFF against the schema.

icaoberg commented 4 years ago

@ngehlenborg @ilan-gold xmlvalid only validates the structure of the OME.TIFF header against a schema.

This is the bare minimum. For example, the images missing the channel names passed the xmlvalid test. There should be a discussion about fields per assay type and how we can use that information at run time to trigger computations within a DAG.

ilan-gold commented 4 years ago

I agree with @icaoberg that we should really be doing more.

  1. Informing the submitters of what their channel names will look like (as a check, like Ivan said, especially if there are none)
  2. Informing them if they leave out physical scale units (since we can display that)
  3. Checking for blank images and warning the submitter of this (Nico has some images with blank channels)
  4. If this tifffile issue with multiple tags continues to be a problem (and it's not a bug), we should overwrite the incoming file (or reject it).
  5. If we really wanted to get fancy, we could find (or try to compile) a list of names for fluorescent stains and check the incoming channel names against a reference list. This will make future cross-dataset work much easier.
  6. We can check to make sure the Z C T number of channels actually lines up with what is in the image.
  7. Provide a warning about large 32 bit images that would need to be pyramidal (due to the visual artifacts we have been seeing).

This is what I can come up with as far as the metadata is concerned.

ilan-gold commented 4 years ago

Another thing we could add is a way to indicate to people that their data is spatially the same, like with seqFish which has a bunch of repeat positions and hybridization cycles.

jswelling commented 4 years ago

See also issue #70

ilan-gold commented 3 years ago

Just a thought @jswelling @ngehlenborg @icaoberg but one option for this beyond running tiffcomment or the like could be to tell people to run the bioformats2raw + raw2ometiff pipeline and check that they can "drag and drop" the output onto Avivator. I find myself doing this a lot anyway and I have started telling people to do it. We could even automate it further to do the following:

  1. Install bioformats2raw + raw2ometiff via conda as mentioned here:
    conda create --name bioformats python=3.8
    conda activate bioformats
    conda install -c ome bioformats2raw raw2ometiff
  2. Run bioformats2raw + raw2ometiff on one of the input files (doesn't matter which, I think this is just a sanity check since people tend to uniformly process their files so one file broken usually indicates them all being broken)
  3. Start a simple http server on their computer that serves the OME-TIFF output of the pipeline that was run locally
  4. Open a web browser with Avivator pointing at that http server

This should all be doable within the python ecosystem from what I can tell. Thoughts?

jswelling commented 3 years ago

I believe this is resolved by https://github.com/hubmapconsortium/ingest-validation-tests/pull/6