Run new Clinvar summary with each run?

MattWellie commented 1 year ago

Clinvar re-summaries are valuable in re-diagnosis, by taking a generous interpretation of the latest clinvar content

After https://github.com/populationgenomics/automated-interpretation-pipeline/pull/175, the Clinvar reprocessing is now a trivially fast operation. Other than the relatively small data egress (~200MB copied into temp from NCBI, ~100MB persisted), re-summarising the latest data at runtime would incur only minor runtime/cost impacts (~1cent, 3 mins).

We could recompute this each time, pulling the NCBI clinvar data live and saving the HT of re-calculated values in the project run folder - this would remove another argument from the configuration file, and ensure we always have access to the exact generated data for interrogation in the case of discrepancies/weird results.

This would also remove the current requirement that the Clinvar data be placed in cpg-reference-test/main to be accessible to all projects.

https://stackoverflow.com/questions/11573817/how-to-download-a-file-via-ftp-with-python-ftplib

MattWellie commented 1 year ago

After a brief discussion we're thinking the following approach:

Add a new first step to each workflow which checks for a clinvar table or makes a new one
when a run starts make a date string specific to the month (e.g. 01-2023)
check for a clinvar summary table (private_clinvar_01-2023.ht) in the dataset bucket
if it doesn't exist, pull the latest from NCBI, re-summarise the whole thing, and save a dated table in GCP (a new first step in the AIP workflow)

Benefits:

prevents pulling/re-processing every time (monthly update windows)
retains hard copy of each table generated for retrospective check
fast, cheap process, not reliant on manual instigation (I'll forget, 100%)

MattWellie commented 1 year ago

Needs: new step, copy files to local, then the python summary script (similar to UMAP flow) Remove the specific path from the config file

populationgenomics / automated-interpretation-pipeline

Run new Clinvar summary with each run? #188