Detect and report missing data in GWO ingest

znatty22 commented 3 years ago

The study creator's GenomicDataLoader currently does not detect any discrepancies between the GWO manifest and S3 or between the GWO manifest and the Dataservice. This is an important part of the analysts' current manual process of loading the harmonized genomic file info into the Dataservice.

Each of the 3 load functions in the GenomicDataLoader should be modified to detect discrepancies and report them either through log statements and/or event firing.

Specifics:

In load_harmonized_genomic_files method:

Detect and report if there is a discrepancy between the files listed in the GWO manifest and the S3 scrape

In load_specimen_harmonized_gf_links method:

Detect and report if there is a discrepancy between the specimens listed in the GWO manifest and the specimens in Dataservice

In load_seq_exp_harmonized_genomic_files method:

Detect and report if any harmonized files were not able to be linked to sequencing experiments (e.g. because the corresponding unharmonized genomic file didn't exist)

gsantia commented 3 years ago

Should any of these three changes lead to a stop in the ingestion process? Or do we just want to report these things?

znatty22 commented 3 years ago

I think just report these things but maybe we should ask @allisonheath

gsantia commented 3 years ago

I've been thinking through the 3rd checks here and it seems to me some parts of it should be done elsewhere. For example, checking that a harmonized genomic file's corresponding genomic file doesn't exist is something we can do immediately just using the GWO manifest itself. Query the dataservice for genomic-files which match the source file column entries and if any are missing then we have a problem.

EDIT: On second thought it probably is better to do it in the load_seq_exp_harmonized_genomic_files method because then we don't need to make extraneous queries to the dataservice

znatty22 commented 3 years ago

@gsantia Yea the issue I wrote up might not be exactly how it turns out to be implemented. You will prob have a better idea since you're doing the implementation. The important thing is we're able to record and report any missing data which we feel is important for the user to know about

kids-first / kf-api-study-creator

Detect and report missing data in GWO ingest #626