tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

Pedigree simulation support for maintaining all nodes (even if unary) #1898

Open hyanwong opened 2 years ago

hyanwong commented 2 years ago

Tthe pedigree simulation code seems not to produce "unary" nodes for the intermediate nodes in the tree sequence, so even if a lineage goes through a known node, this information is lost. For instance, consider the following simulation:

# Create a WF pedigree of 5 individuals for 4 generations
N = 5
tables = msprime.pedigrees.sim_pedigree(
    population_size=N, random_seed=1234, end_time=3
)

tables.sequence_length = 100

# Simulate on top of the pedigree
ped_ts = msprime.sim_ancestry(
    initial_state=tables,
    model="wf_ped",
    recombination_rate=1,
    random_seed=12345,
)
print(ped_ts.first().draw_text())

Giving

  0       7        8      3  
  ┃       ┃        ┃      ┃  
  ┃      10        ┃     14  
  ┃    ┏━━╋━━━┓    ┃    ┏━┻━┓
 27    ┃  ┃  26    ┃   21   ┃
 ┏┻━┓  ┃  ┃  ┏┻━┓  ┃  ┏━┻┓  ┃
30 35 31 32 34 38 33 36 39 37

If all nodes on a lineage were retained, we should have unary nodes at the same timepoints as nodes 10 and 14 on each of the other lineages. Indeed, these unary nodes, which appear to have been simplified away, might even be a coalescent node somewhere else in the simulation: I suspect this would be useful extra information to be able to retain, in some circumstances.

jeromekelleher commented 2 years ago

I guess this is the record_full_arg=True option for the FixedPedigree model?

hyanwong commented 2 years ago

I guess this is the record_full_arg=True option for the FixedPedigree model?

I'm not sure that's quite the same. For instance, imagine if there is no recombination. We might still want all the intermediate nodes on all lineages present.