malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Slow performance saving results to cache #466

Closed alimanfoo closed 9 months ago

alimanfoo commented 9 months ago

In benchmarking haplotype clustering with larger numbers of haplotypes I've found situations where it can take up to a minute to save the results of pairwise distance computation to the results cache. This seems to be entirely due to slow performance of numpy savez_compressed(). Using zarr save() instead, which is a drop-in replacement, runs much faster at around 1s.