hyanwong / giglib

MIT License
4 stars 2 forks source link

Should GIGs be allowed to have a node at time=infinity? #95

Closed hyanwong closed 8 months ago

hyanwong commented 8 months ago

At the moment we allow a gig to have a "grand MRCA node" at time=infinity, which allows us to find MRCA regions even if there are multiple roots. In the existing simulator code, this infinite-time MRCA is deliberately removed from the GIG on output, but is present when we turn the simulation-stored tables directly into a GIG.

This isn't allowed in tskit. Since the find_mrca function works on tables, we could simply ban infinite-time nodes from being frozen into a GIG and expect to run the function only on tables, not the frozen GIG. Or we could raise an error when we export such a GIG to a tree sequence instead. Or we could have an argument to the to_tree_sequence method that removes infinite-time nodes.

Either way, we should probably take an active decision about whether we deviate from the tskit paradigm here.

hyanwong commented 8 months ago

One implication of this is what we do when comparing the average genomic distance between two genomes, one of which has a material not present in the other (e.g. there has been a deletion). There will be a difference between such a region (which may not have any MRCA, and which could return NaN) and a region where the MRCA is at time=infinity.

hyanwong commented 8 months ago

@duncanMR and I decided we should allow nodes at time=infinity, but we should check that we don't allow a node at time=inf to be the parent of another node at time=inf.