malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Add option to filter samples by max heterozygosity in advanced diplotype clustering #557

Open alimanfoo opened 2 weeks ago

alimanfoo commented 2 weeks ago

When doing diplotype clustering we often are only interested in clusters with low heterzygosity because these are representative of swept haplotypes. So it might be useful to have a parameter in the advanced diplotype clustering function to remove samples above a certain threshold of heterozygosity.

sanjaynagi commented 2 weeks ago

how would you like to approach this?

  1. Run the heterozygosity trace function before anything else, returning a list of samples passing threshold.
  2. Pass those samples using sample_query to other functions?

Its a bit tricky because the het_bar function needs the dendrogram sample order, so we cant run this function before the dendrogram, but also, we don't want to run it twice.

alimanfoo commented 2 weeks ago

Hm, I forgot about the dendrogram. Maybe this is a bit harder than I imagined. Definitely would be better to avoid running anything twice if possible. Will have a think...