Improve performance of haplotype clustering

malariagen / malariagen-data-python

Analyse MalariaGEN data from Python

https://malariagen.github.io/malariagen-data-python/latest/

MIT License

13 stars 23 forks source link

Improve performance of haplotype clustering #450

Closed alimanfoo closed 9 months ago

alimanfoo commented 9 months ago

Resolves #449 via:

[x] Only compute haplotype pairwise distances once.
[x] Copy haplotype data to fortran memory layout.
[x] Cache results from haplotype pairwise distances calculation.
[x] Remove non-segregating sites before pairwise distance calculation.
[x] Try optimising hamming pairwise distance with numba.

I also tried using scikit-learn's implementation of pairwise distances, but for hamming distance it reverts to using scipy anyway so performance is no better.

Also partially addresses #451 by adding render_mode parameter to the plot_haplotype_clustering() function, but I'll leave a full implementation of that for other plotting functions for another PR.

Also manually merges in changes from #441 (add option to use cohorts as colour or symbol) because things have been moved around quite a bit here within the plot_haplotypes_clustering() function.

codecov[bot] commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5198e85) 97.51% compared to head (fa63c05) 97.51%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #450 +/- ## ======================================= Coverage 97.51% 97.51% ======================================= Files 26 26 Lines 2090 2092 +2 ======================================= + Hits 2038 2040 +2 Misses 52 52 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

review-notebook-app[bot] commented 9 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

alimanfoo commented 9 months ago

For haplotype clustering over all samples in Ag3.0 at the Vgsc locus, this implementation runs in 48s whereas the previous implementation runs in 3m 14s.

alimanfoo commented 9 months ago

Down to 28s with a optimised implementation of hamming pairwise distance, of which only 6s is spent in the pairwise distance calculation.

alimanfoo commented 9 months ago

For future reference, there is another possible optimisation which is to first find distinct haplotypes, then compute pairwise distances only between distinct haplotypes. I.e., it would be possible to skip the calculation for pairs of haplotypes that are identical. But that would introduce a fair amount of complexity, so exercise for the reader :)

alimanfoo commented 9 months ago

Will merge here later if CI passes.

sanjaynagi commented 9 months ago

Love it