Closed MattWellie closed 1 year ago
After a brief discussion we're thinking the following approach:
01-2023
)private_clinvar_01-2023.ht
) in the dataset bucketBenefits:
Needs: new step, copy files to local, then the python summary script (similar to UMAP flow) Remove the specific path from the config file
Clinvar re-summaries are valuable in re-diagnosis, by taking a generous interpretation of the latest clinvar content
After https://github.com/populationgenomics/automated-interpretation-pipeline/pull/175, the Clinvar reprocessing is now a trivially fast operation. Other than the relatively small data egress (~200MB copied into temp from NCBI, ~100MB persisted), re-summarising the latest data at runtime would incur only minor runtime/cost impacts (~1cent, 3 mins).
We could recompute this each time, pulling the NCBI clinvar data live and saving the HT of re-calculated values in the project run folder - this would remove another argument from the configuration file, and ensure we always have access to the exact generated data for interrogation in the case of discrepancies/weird results.
This would also remove the current requirement that the Clinvar data be placed in
cpg-reference-test/main
to be accessible to all projects.https://stackoverflow.com/questions/11573817/how-to-download-a-file-via-ftp-with-python-ftplib