malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Advanced diplotype clustering #537

Closed sanjaynagi closed 1 month ago

sanjaynagi commented 1 month ago

Builds on #507 and partially addresses #361.

In this PR, I add an advanced diplotype clustering function which performs diplotype clustering on a given region, and plots alongside it sample heterozygosity, SNP genotypes at interesting SNP variants (e.g., amino acid change substitutions), and copy number variation data. This is particularly useful for investigating potential driver resistance mutations at IR loci.

A new function plot_diplotype_clustering_advanced() is added to the Ag3 and Af1 APIs.

review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

sanjaynagi commented 1 month ago

ahhh, it seems I forgot I was working on a fork, as such, the google credentials dont work. Whats the best approach in future? @ahernank @alimanfoo

ahernank commented 1 month ago

Hi @sanjaynagi, a short-term solution, but you have permissions to write directly on the repo rather than using branches.

sanjaynagi commented 1 month ago

I've added CNV support, though currently the bar is copy number, but potentially it should be # extra copies. Here is how it currently looks (no CNVs here) image

alimanfoo commented 1 month ago

Thanks @sanjaynagi. Just to mention I updated this branch to bring in some unrelated test fixes from master.

alimanfoo commented 1 month ago

Very exciting, I'll give this a proper look asap!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 12.19512% with 108 lines in your changes missing coverage. Please review.

Project coverage is 95.82%. Comparing base (dc89c9c) to head (5048263). Report is 9 commits behind head on master.

:exclamation: Current head 5048263 differs from pull request most recent head ac84ac0

Please upload reports for the commit ac84ac0 to get more accurate results.

Files Patch % Lines
malariagen_data/anoph/dipclust.py 11.86% 104 Missing :warning:
malariagen_data/anoph/snp_frq.py 20.00% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #537 +/- ## ========================================== - Coverage 98.61% 95.82% -2.80% ========================================== Files 38 38 Lines 3690 3810 +120 ========================================== + Hits 3639 3651 +12 - Misses 51 159 +108 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sanjaynagi commented 1 month ago

This PR is almost ready, but im finding it really difficult to map copy numbers in plot_dendro_cnv_bar() to specific colours, within px.imshow.

Each CNV plot obviously uses its own min and max to specify the color range, but finding it strangely difficult to dynamically update the thresholds depending on what copy numbers are present in that plot. has this been done before in malariagen_data? please help @alimanfoo I am tired

Also the plot should be marginally zoomed in, I dont know why its not (not the end of the world).

alimanfoo commented 1 month ago

Hi @sanjaynagi, some investigation into the CNV plotting issue here: https://github.com/sanjaynagi/malariagen-data-python/pull/1

sanjaynagi commented 1 month ago

Some updates:

thanks to @alimanfoo for sorting the CNV colorscale problem.

Ive completely removed any usage of px.imshow() now and replaced with go.Heatmap() for heterozygosity and amino acid heatmaps as well as CNVs. This also simplifies the code further, as it seems that we now don't need to pass the x_range through various functions which we did previously.

I've added a check to see if there are actually any amino acids above the MAF threshold, if there are not, an amino acid heatmap will not be created.

Think we just need to maybe write some tests, and a final check from @alimanfoo, then we may be good to go! yeehaw!! Ill have a go at zooming in the x-axis aswell (currently its slightly zoomed out).

image

alimanfoo commented 1 month ago

Hi @sanjaynagi, some comments and suggestions over in https://github.com/sanjaynagi/malariagen-data-python/pull/2

sanjaynagi commented 1 month ago

@alimanfoo I've had a go at all the comments except the removing colorbars earlier one, cos I tried to do that in the past and it took me a while and I couldn't :)

alimanfoo commented 1 month ago

Have pushed a few suggested changes directly to this branch, hope that's OK. Still reviewing so may do a bit more.

alimanfoo commented 1 month ago

Hi @sanjaynagi, I pushed a few more changes, includes:

alimanfoo commented 1 month ago

Hi @sanjaynagi, I think this is basically good to merge. Because this is a PR from a fork the CI tests won't run, so I'm thinking to merge this PR then open up a follow on PR from a branch within the main repo to confirm tests all run. How does that sound? Anything else you'd like to do here before I merge?

sanjaynagi commented 1 month ago

Hey @alimanfoo . Sounds good to me. Nothing else to add currently.