malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Implement diplotype clustering, refactor haplotype clustering, and fix random_region_str() #507

Closed sanjaynagi closed 5 months ago

sanjaynagi commented 6 months ago

In this PR, we implement a plot_diplotype_clustering() function, which works exactly like plot_haplotype_clustering(), except it works on diplotypes rather than haplotypes.

This PR works towards #361, we aim to make an 'advanced' version of this plotting function which plots amino acid mutation heatmap below it, as well as sample heterozygosity and possibly CNV copy number and diplotype cluster assignments. For now, we start with the simple dendrogram itself.

I have added (Alistairs) functions into util.py which calculate pairwise distances for diplotypes. As far as I can tell, these functions were marginally different to the existing biallelic diplotype functions, so i have added new multiallelic ones. It might be that this is not quite necessary, please check @alimanfoo :)

review-notebook-app[bot] commented 6 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.87%. Comparing base (6aef71a) to head (114923c). Report is 46 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #507 +/- ## ========================================== + Coverage 98.79% 98.87% +0.08% ========================================== Files 35 38 +3 Lines 3474 3660 +186 ========================================== + Hits 3432 3619 +187 + Misses 42 41 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sanjaynagi commented 6 months ago

CodeCov is failing because I have only made a notebook and not tests. This is the way plot_haplotype_clustering() works.

Should i write tests...? I guess we only started using codecov after the haplotype clustering functions were implemented.

sanjaynagi commented 5 months ago

hey guys @ahernank @leehart @alimanfoo , I just saw the discussion in #506 around the problem in random_region_str() with regions that exceed the max contig size. At some point I had the same issue, so just to say that I have already fixed it in this PR.

Changing title of PR accordingly.

leehart commented 5 months ago

Cool, thanks @sanjaynagi 👍

Resolves #509

alimanfoo commented 5 months ago

Beautiful, thanks @sanjaynagi :) Merging...