malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Investigate mutations on haplotypes #361

Open alimanfoo opened 1 year ago

alimanfoo commented 1 year ago

We commonly would like to be able to investigate which haplotypes carry which mutations, particularly non-synonymous SNPs and CNVs. This is usually in the context of investigating a locus under recent selection, where typically you want to...

(1) Discover a locus under selection (2) Investigate how many distinct sweeps at the locus, and whether sweeps are shared between species and/or countries (3) Investigate any mutations on the swept haplotypes which might be driving selection

Previously we have done point 3 of this workflow in two different ways...

(A) Add information about non-synonymous SNPs as an additional sub-plot on a haplotype clustering dendrogram. E.g.:

image

(B) Add information about non-synonymous SNPs onto a haplotype network by labelling the edges. E.g.:

image

We might want to ultimately split this issue into two issues, one covering haplotype dendrograms and one covering haplotype networks, but raising this umbrella initially to discuss some common points.

alimanfoo commented 1 year ago

Previously @sanjaynagi and I have had lots of discussion about this, will try to surface some of that here :)

alimanfoo commented 1 year ago

A key issue here is that the phased haplotypes available from the main haplotypes data releases do not contain some SNPs that you may be interested in. For example, the haplotypes only contain biallelic SNPs, and so any multiallelic (such as Vgsc V402L) are not included. Also, any SNPs failing site filters are not included, but site filters are very conservative and may filter out some SNPs you are interested in.

Previously we resolved this by performing some additional phasing of extra SNPs of interest onto the haplotype scaffold, using mvncall. However, this is awkard and painful to do, and not possible to support within the malariagen_data package in a general way.

This remains a difficult issue, and I think requires some rethinking of the approach.

DeribaAbera1234 commented 1 year ago

Dear Alistair, Thank you so much for the clarifications

sanjaynagi commented 1 year ago

Thanks Alistair for the summary, and great timing as I had also started to think about trying to implement something for the Af1000 project analyses, in which we will need to do this.

As we've spoke about in the past, perhaps it's time for us to try and retrieve amino acid mutations from diplotype clusters instead :)

alimanfoo commented 9 months ago

Hi @sanjaynagi, based on recent experience with haplotype clustering, it's likely the pairwise distance calculation will be the main performance bottleneck here, especially as the number of diplotypes increases.

I did a bit of exploration to see if we can improve performance of the calculation of pairwise distance between diplotypes, notebook here:

https://colab.research.google.com/drive/1v1R6j1RrywKmIgnpBuInNyeIueGlRGB1?usp=sharing

sanjaynagi commented 9 months ago

Awesome, looks to be quite a lot quicker than the older code I was working with...

image

sanjaynagi commented 9 months ago

So, for when I go to implement the functions in malariagen_data, some things that we discussed. This is big chunk of work, so it will be best to break it up into two or more PRs.

  1. Simple diplotype clustering dendrogram on its own (euclidean + cityblock as metric options)

  2. Plotly functions to plot tracks. Probably needs to take a pandas dataframe with two columns, sample_id and value, potentially re-order by given sample_id list, and plot values in track, returning a plotly fig.

    • heterozygosity function
    • cut the dendrogram (obtain clusters) function
    • CNV function
  3. Amino acid variant x samples table.