hyanwong / giglib

MIT License
4 stars 2 forks source link

Definition of genetic diversity in a GIG #1

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

From @petrelharp:

think the next thing to do is to figure out what we want to use it for - like, what things want to be computed? And, then, see if they can be computed easily this way. The question of, for instance, "how to summarize genetic diversity with lots of rearrangements/indels" is an unsolved question; maybe this suggests a natural way to do it.

racalzadilla commented 11 months ago

Following up on the comment posted, I'd like to understand what is the scope you desire for this structure. I assume you're probably most interested at the genealogical (as opposed to phylogenetic) level? Would you also include the next level down (ontogenic)? Defining the scope in terms of the scale of evolution, but also in terms of taxa (or taxonomic boundaries rather) of would aid in defining the kind of variation considered.

hyanwong commented 11 months ago

Hi @racalzadilla - glad you are interested in this topic! I think there should be "branch length" versions of genetic diversity measures which apply to a GIG. The difficulty is that a GIG should sometimes be seen as a set of "local graphs" rather than "local trees", so calculations may be more complicated.

@duncanMR : this might be a fruitful avenue to pursue in the quest for a balance between using a tree structure with duplicated nodes and a graph structure. It may be that calculations of diversity are better addressed using one rather than another. Certainly worth chewing over, and also consulting Peter Ralph, I think.

duncanMR commented 11 months ago

Following up on the comment posted, I'd like to understand what is the scope you desire for this structure. I assume you're probably most interested at the genealogical (as opposed to phylogenetic) level? Would you also include the next level down (ontogenic)? Defining the scope in terms of the scale of evolution, but also in terms of taxa (or taxonomic boundaries rather) of would aid in defining the kind of variation considered.

@racalzadilla Thanks for the question! We are initially focusing on developing the algorithms and theory of the GIG as a tool for genealogical popgen. This will involve building a GIG simulator that can account for genomes of varying lengths, with an innate alignment based on the genealogy. If we can achieve that, we think that there is exciting potential to resolve phylogenies, since we would not be limited by a fixed genome length as we are in the case of ARGs. I don't think we've discussed possibilities of using GIGs at the ontogenic level, but @hyanwong correct me if I'm wrong? We have considered simulating the progression of a cancer tumour with a GIG, since we can account for the structural variation without worrying about recombination.

hyanwong commented 11 months ago

I think the easiest thing to do as a starter is to consider the role of population genetic statistics in telling us about (a) population structure (e.g. via definitions of Ne / random/nonrandom mating) and (b) selection. I think we should be able to look at the effects both of these processes have on a GIG, and derive some sensible ways of measuring features of the GIG (e.g. branch lengths / coalescent densities) that inform these quantities.

hyanwong commented 8 months ago

We can use the MRCA-finder to define a branch length difference between any two genomes which is defined from the point of view of their common ancestral (shared) segments. Because it is looking from the PoV of the MRCA, it should be symmetrical (if there are duplications between the MRCA and one of the samples we can average the branch lengths or mutational distances). I think this should make it a valid metric.

Probably the first thing to do here is to create a function that averages out this metric between two nodes (maybe complicated if each node has multiple chromosomes). We can then test various properties hold when the GIG has no SVs (and compare this to the equivalent tree sequence), and see how they change as we introduce SVs.