hyanwong / giglib

MIT License
4 stars 2 forks source link

The Genetic Inheritance Graph Library (giglib)

A basic repo for kicking around ideas for the "(Generalised) Genetic Inheritance Graph" structure, which should be able to capture genetic inheritance with genomic rearrangements, and hence describe inherited structural variation. This is not meant to be stable software, and the API is subject to change at any time, but for the moment you can try it out by converting a succinct tree sequence to a GIG (although such a gig will not contain structural variation)

import msprime
import giglib as gigl  # 😁

ts = msprime.sim_ancestry(
    4, sequence_length=100, recombination_rate=0.01, random_seed=1
)
gig = gigl.Graph.from_tree_sequence(ts)
print(len(gig.nodes), "nodes in this GIG")

Basic idea

In tskit we use edge annotations to describe which pieces of DNA are inherited in terms of a left and right coordinate. It should be possible to extend this to track the L & R in the edge child, and the L & R in the edge parent separately. The left and right values in each case refer to the coordinate system of the child and parent respectively.

I'm calling an extended tree-sequence-like structure such as this, a GIG ("Generalised (or Genetic) Inheritance Graph"). For terminological clarity, we switch to using the term interval-edge (iedge) to refer to what is normally called an edge in a tskit Tree Sequence.

Note that separating child from parent coordinates brings a host of extra complexities, and it’s unclear if the efficiency of the tskit approach, with its edge indexing etc, will port in any meaningful way to this new structure. Nevertheless, basic operations such as simplification and finding MRCA genomes have already been (inefficently) implemented.

Inversions

The easiest example is an inversion. This would be an iedge like

{parent: P, child: C, child_left: 100, child_right: 200, parent_left: 200, parent_right: 100}

There is a subtle gotcha here, because intervals in a GIG, as in tskit, are treated as half-closed (i.e. do not include the position given by the right coordinate). When we invert an interval, it therefore does not include the left parent coordinate, but does include the right parent coordinate. Any transformed position is thus out by one. Or to put it another way, an inversion specified by child_left=0, child_right=3, parent_left=3, parent_right=0 transforms the points 0, 1, 2 to 2, 1, 0: although the interval 0, 3 is transformed to 0, 3., the point 0 is transformed to position 2, not position 3. See here for more discussion.

Duplications

A tandem duplication is represented by two iedges, one for each duplicated region:

{parent: P, child: C, child_left: 100, child_right: 200, parent_left: 100, parent_right: 200}
{parent: P, child: C, child_left: 200, child_right: 300, parent_left: 100, parent_right: 200}

Or one of the iedges could represent a non-adjacent duplication (e.g. corresponding to a transposable element):

{parent: P, child: C, child_left: 250, child_right: 350, parent_left: 100, parent_right: 200}

Deletions

A deletion simply occurs when no material from the parent is transmitted to any of its children (and the coordinate system is shrunk)

# Deletion of parental region from 200-300
{parent: P, child: C, child_left: 100, child_right: 200, parent_left: 100, parent_right: 200}
{parent: P, child: C, child_left: 200, child_right: 300, parent_left: 300, parent_right: 400}

Viz

The graphical interpretation is something like this, elegantly drawn by @duncanMR:

GIG

Meaning of trees

The interpretation of local trees may be rather different from that in tskit. In particular, if there is a duplication, it could be interpreted as creating two tree tips within a single sample. This would mean that there is no longer a 1-to-1 mapping from samples to tree tips, which breaks one of the fundamental principles behind ARGs in general and the tskit representation in particular.

API differences

The fundamental data structures and API provided by the Genetic Inheritance Graph Library as defined in this GitHub repository intentionally mirror that of tskit (which makes it easy e.g. to initialise a GIG from an msprime-simulated tree sequence 🎉). Nevertheless, there are a number of terminological and implementation differences, an incomplete list of which are below:

Examples

As with a tskit tree sequence, a GIG is immmutable, so to create one from scratch you need to make a set of Tables and freeze them using the .graph() method (which gives opportunity to cache and index important stuff):

import giglib as gigl

tables = gigl.Tables()
tables.nodes.add_row(0, flags=gigl.NODE_IS_SAMPLE)
tables.nodes.add_row(1, flags=0)
# Add an inversion
tables.iedges.add_row(
    parent=1, child=0, child_left=0, child_right=1, parent_left=1, parent_right=0
)
gig = tables.graph()
assert len(gig.nodes) == 2
assert len(gig.iedges) == 1
assert len(gig.samples) == 1
assert gig.iedges[0].is_inversion()

Simulations

Examples of code that runs simulations are provided in the test suite. Currently they are based on a relatively simple extension of the boilerplate forward-simulation code in https://tskit.dev/tutorials/forward_sims.html. The important difference is that the find_mrca_regions method is used to locate breakpoints for recombination.

For more details the sim.py file is a convenience wrapper that links out to other files containing simulation code. Currently only forward simulation code is provided. Nevertheless, it is conceptually general enough to form the basis of a pangenome simulator (this would however require substantial effort in fixing reasonable parameters before realistic-looking pangenomes could be created).

For a hacky example of how to use the simulation code directly from the test suite, see https://github.com/hyanwong/giglib/issues/86#issuecomment-1970029273 or https://github.com/hyanwong/giglib/issues/82#issuecomment-1972153550.

Documentation

Preliminary documentation is at https://hyanwong.github.io/giglib/ (partially autogenerated from the codebase). Note that the API is not stable and may change at any time.