Open alimanfoo opened 1 year ago
Previously @sanjaynagi and I have had lots of discussion about this, will try to surface some of that here :)
A key issue here is that the phased haplotypes available from the main haplotypes data releases do not contain some SNPs that you may be interested in. For example, the haplotypes only contain biallelic SNPs, and so any multiallelic (such as Vgsc V402L) are not included. Also, any SNPs failing site filters are not included, but site filters are very conservative and may filter out some SNPs you are interested in.
Previously we resolved this by performing some additional phasing of extra SNPs of interest onto the haplotype scaffold, using mvncall. However, this is awkard and painful to do, and not possible to support within the malariagen_data package in a general way.
This remains a difficult issue, and I think requires some rethinking of the approach.
Dear Alistair, Thank you so much for the clarifications
Thanks Alistair for the summary, and great timing as I had also started to think about trying to implement something for the Af1000 project analyses, in which we will need to do this.
As we've spoke about in the past, perhaps it's time for us to try and retrieve amino acid mutations from diplotype clusters instead :)
Hi @sanjaynagi, based on recent experience with haplotype clustering, it's likely the pairwise distance calculation will be the main performance bottleneck here, especially as the number of diplotypes increases.
I did a bit of exploration to see if we can improve performance of the calculation of pairwise distance between diplotypes, notebook here:
https://colab.research.google.com/drive/1v1R6j1RrywKmIgnpBuInNyeIueGlRGB1?usp=sharing
Awesome, looks to be quite a lot quicker than the older code I was working with...
So, for when I go to implement the functions in malariagen_data, some things that we discussed. This is big chunk of work, so it will be best to break it up into two or more PRs.
Simple diplotype clustering dendrogram on its own (euclidean + cityblock as metric options)
Plotly functions to plot tracks. Probably needs to take a pandas dataframe with two columns, sample_id and value, potentially re-order by given sample_id list, and plot values in track, returning a plotly fig.
Amino acid variant x samples table.
We commonly would like to be able to investigate which haplotypes carry which mutations, particularly non-synonymous SNPs and CNVs. This is usually in the context of investigating a locus under recent selection, where typically you want to...
(1) Discover a locus under selection (2) Investigate how many distinct sweeps at the locus, and whether sweeps are shared between species and/or countries (3) Investigate any mutations on the swept haplotypes which might be driving selection
Previously we have done point 3 of this workflow in two different ways...
(A) Add information about non-synonymous SNPs as an additional sub-plot on a haplotype clustering dendrogram. E.g.:
(B) Add information about non-synonymous SNPs onto a haplotype network by labelling the edges. E.g.:
We might want to ultimately split this issue into two issues, one covering haplotype dendrograms and one covering haplotype networks, but raising this umbrella initially to discuss some common points.