theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
36 stars 17 forks source link

TheiaValidate: Compare file contents #264

Closed sam-baird closed 7 months ago

sam-baird commented 9 months ago

Closes (Issue moved to theiagen/theiavalidate, issues 1 and 2).

:hammer_and_wrench: Changes Being Made

This pull request is associated with a pull request in theiagen/theiavalidate to add file content comparisons to TheiaValidate. An array of diff files is added as an output to wf_theiavalidate.wdl. docker is also added as an input variable to task_validate.wdl to allow the user to select a different Docker image. The default image is still "us-docker.pkg.dev/general-theiagen/theiagen/theiavalidate:0.0.1".

Impacted Workflows/Tasks

task_validate.wdl wf_theiavalidate.wdl

:brain: Context and Rationale

A requested feature of TheiaValidate is the ability to compare file contents. Currently, TheiaValidate treats gs:// URIs as strings and compares them as so, rather than comparing the contents of the referenced files to see if they match.

:clipboard: Workflow/Task Steps

Details of the implemented steps can be found in the associated pull request in the theiavalidate repo.

Inputs

The inputs are the same as before, except the user can specify "EXACT", "IGNORE", or "SET" validation criteria for columns referencing files inside the validation_criteria.tsv. Any "file columns" are detected automatically inside the workflow and do not need to be specified as such by the user.

Outputs

The diff files are an additional output, but the other outputs are essentially unchanged.

Impacted Outputs

File comparison results are added to the report PDF and differences TSVs in the same way as the "regular" results. For example, in the exact_differences.tsv the gs:// URIs are shown for mismatching files, but are not shown for matching files.

:test_tube: Testing

Locally

Here are the unit testing results of the file comparison code (from running python3 -m unittest), which runs the test suite inside tests/test_validator.py. The overall workflow was not ran locally as it is intended to work within Terra, but a potential future feature could be to compare local files given local file paths.

image

Terra

The inputs/outputs of the Terra tests can be found in examples/file_comparison/ in the pull request in the theiavalidate repo. The docker variable was changed from the default to the other pull request's version. Access can be provided to the actual test Terra workspace if requested.

image

Scenarios for Reviewer to Test

I tried to handle edge cases like empty cells in the input tables and raising an Exception if anything other than "SET", "EXACT", or "IGNORE" are provided for file_columns. I think it would be good to test on real-world and larger files because currently the tests are on simple files of only a few bytes.

:microscope: Quality checks

Pull Request (PR) checklist:

sage-wright commented 8 months ago

Hey Sam!

Thanks so much for this PR! We probably won't get to merge this into this repository until January because of the holidays, but I'll keep you updated throughout the process. Most of this conversation will probably happen in the theiavalidate repository because that's where the majority of the code is, but I'm so excited to dig into this!

sage-wright commented 7 months ago

Thanks again, @sam-baird! I'm going to merge this into a branch so we can test it out on Terra and confirm everything is working as expected, and then merge into main.