Use-case: whole genome duplication / polyploidy

hyanwong commented 9 months ago

@szhan suggests another use-case for a GIG: to simulate whole genome duplication (WGD) followed by gene loss, so that we can look at the concept of orthodoxy / paralogy. This would require some element of selection to remove duplicates at random from one of the duplicated genomes. I can imagine a simple simulator, with only one (haploid) chromosome, initialised using an msprime simulation (see #14) where the chromosome gets duplicated somehow. The simulator would then gradually delete regions at random. Crossovers would occur between non-deleted regions.

A more sophisticated simulation would reduce crossover between chromosomes that became "too different" somehow, either due to accumulation of mutations or due to synteny being lost. This could be the hardest part of a simulation to implement.

The "classic" scenario, according to @szhan, is to "have diploids and tetraploids with gene flow between then via triploids", i.e. "a single species with multiple cytotypes segregating".

It is unclear to me when it would become necessary to identify the four chromosomes in a tetraploid as consisting of two separate pairs, or what this means for using chromosome IDs in the GIG format (see #11 )

racalzadilla commented 6 months ago

@hyanwong Cases in which it is relevant to "identify the four chromosomes in a tetraploid as consisting of two separate pairs" are in plant genetics. For allotetraploidy, which is the case you're considering, you'd have homeologous pairs (the e is not a typo), something like AABB. Their segregation, assortment and crossover propensities have added complexities in which the homeology matters. You could think of it as more elaborate phases within a genome.

These ploidy shenanigans you describe in this thread are rampant amongst plants; often (iterated) hybridization is the culprit. For some reason botanists don't follow Mayr's species concept. Here's a fresh case I just found. https://www.nature.com/articles/s41467-023-38829-3 In my experience, I've found that cases with exotic ploidy are never what they seem--- ie, a hexaploid is never really 6n, but something crazy like AABBDD (this is the case in one kind of wheat). https://www.nature.com/articles/nature11997

Hope that's of some help!

hyanwong commented 3 months ago

Once we fix #103, I think we should be fine to simulate evolution under different ploidies. For each node we will need to define (potentially different) chromosome IDs, but I think that's fine. The chromosome IDs need not correspond to the chromosome "numbers" used by cytologists. This would be fine to model allopolyploidy, and also for tackling things like #12, where all the chromosomes for a given "node" (i.e. that originate from one of the gametes, e.g. the "paternal chromosomes") are different from each other, and do not pair with each other in meiosis, but only pair with genetic material that came from the "other" gamete (i.e. the "maternal chromosomes").

What we might struggle to do, in the current framework, is allow recombination between two chromosomes associated with the same node (i.e. 2 duplicated maternal chromosomes). This is the sort of thing that can happen in autopolyploidy. Reworking the simulation framework to allow for this would require some messing around with the find_mrca_regions routine. In particular, we would need to pass in chromosomes to find_mrca_regions(), rather than node IDs. This could be a hassle because we don't really want to try every combination of chromosome pairs to find the best match.

hyanwong / GeneticInheritanceGraphLibrary

Use-case: whole genome duplication / polyploidy #15