Closed alimanfoo closed 9 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
5198e85
) 97.51% compared to head (fa63c05
) 97.51%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
For haplotype clustering over all samples in Ag3.0 at the Vgsc locus, this implementation runs in 48s whereas the previous implementation runs in 3m 14s.
Down to 28s with a optimised implementation of hamming pairwise distance, of which only 6s is spent in the pairwise distance calculation.
For future reference, there is another possible optimisation which is to first find distinct haplotypes, then compute pairwise distances only between distinct haplotypes. I.e., it would be possible to skip the calculation for pairs of haplotypes that are identical. But that would introduce a fair amount of complexity, so exercise for the reader :)
Will merge here later if CI passes.
Love it
Resolves #449 via:
I also tried using scikit-learn's implementation of pairwise distances, but for hamming distance it reverts to using scipy anyway so performance is no better.
Also partially addresses #451 by adding
render_mode
parameter to theplot_haplotype_clustering()
function, but I'll leave a full implementation of that for other plotting functions for another PR.Also manually merges in changes from #441 (add option to use cohorts as colour or symbol) because things have been moved around quite a bit here within the
plot_haplotypes_clustering()
function.