feature: use diskcache - Githubissues

gks-anvil / vrs_anvil_toolkit

Extract clinical variant interpretations from VCF using GA4GH VRS IDs

MIT License

2 stars 1 forks source link

feature: use diskcache #8

Closed bwalsh closed 8 months ago

bwalsh commented 8 months ago

Use cases

As a vrs-anvil user, when I run the system and need to re-start or add additional datasets, or run in different processes, the cache that stores vrs-objects or metakb keys should be available. ie I should not have to start from a fresh, empty cache.
As a vrs-anvil user, when the underlying data changes, ie new vrs-python schema or new metakb files, I need to be able to empty the cache

Potential solutions

https://github.com/grantjenks/python-diskcache

quinnwai commented 8 months ago

Testing locally with 1000G patients

anecdotally, running ~100k variants, there is a significant difference in running pytest test_vcf_to_gnomad.py for the first time (9.5s) vs the second time after the persistent cache (1s) for the same sample (HG00096). When running the same command for a new sample (HG00099), there is a nontrivial speedup (4.5s) as well the first time as well as the second time (1.3s).