populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Run new Clinvar summary with each run? #188

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Clinvar re-summaries are valuable in re-diagnosis, by taking a generous interpretation of the latest clinvar content

After https://github.com/populationgenomics/automated-interpretation-pipeline/pull/175, the Clinvar reprocessing is now a trivially fast operation. Other than the relatively small data egress (~200MB copied into temp from NCBI, ~100MB persisted), re-summarising the latest data at runtime would incur only minor runtime/cost impacts (~1cent, 3 mins).

We could recompute this each time, pulling the NCBI clinvar data live and saving the HT of re-calculated values in the project run folder - this would remove another argument from the configuration file, and ensure we always have access to the exact generated data for interrogation in the case of discrepancies/weird results.

This would also remove the current requirement that the Clinvar data be placed in cpg-reference-test/main to be accessible to all projects.

https://stackoverflow.com/questions/11573817/how-to-download-a-file-via-ftp-with-python-ftplib

MattWellie commented 1 year ago

After a brief discussion we're thinking the following approach:

Benefits:

MattWellie commented 1 year ago

Needs: new step, copy files to local, then the python summary script (similar to UMAP flow) Remove the specific path from the config file