hyanwong / giglib

MIT License
4 stars 2 forks source link

Define a "TreeSequence"-like class #44

Closed hyanwong closed 10 months ago

hyanwong commented 11 months ago

We are starting to get to the stage where it might be useful to have a "immutable" GIG. I.e. like in tskit, we turn the tables representation of a GIG into an immutable object, with defined properties (e.g. an index, so we can iterate over edges by child ID, an "edge" class so that we can do edge.is_inversion(), and so on.

I have some code to start doing this. I suggest something like:

import GeneticInheritanceGraph as gig
import msprime

ts = msprime.sim_ancestry(5)
tables = gig.Tables.from_tree_sequence(ts)
gig = tables.graph()  # This should check that we have a valid GIG defined, e.g. abs(child_right - child_left) ==  abs(parent_right - parent_left)

print(gig.edges[0])
print(gig.edges.child_left[0])
...
hyanwong commented 11 months ago

When calling tables.graph(), we can also populate the (new) edge column of the iedges table.

hyanwong commented 11 months ago

Another really useful function (which I will need) is a conversion function to take a location between child_left and child_riught, and convert it to a parent position. I'm not sure of the best name, perhaps edge.parent_position(child_position) (and `edge.child_position(parent_position))? Suggestions @duncanMR ?

duncanMR commented 11 months ago

We are starting to get to the stage where it might be useful to have a "immutable" GIG. I.e. like in tskit, we turn the tables representation of a GIG into an immutable object, with defined properties (e.g. an index, so we can iterate over edges by child ID, an "edge" class so that we can do edge.is_inversion(), and so on.

I have some code to start doing this. I suggest something like:

import GeneticInheritanceGraph as gig
import msprime

ts = msprime.sim_ancestry(5)
tables = gig.Tables.from_tree_sequence(ts)
gig = tables.graph()  # This should check that we have a valid GIG defined, e.g. abs(child_right - child_left) ==  abs(parent_right - parent_left)

print(gig.edges[0])
print(gig.edges.child_left[0])
...

I've been working on something like this for visualisation experiments, based on Jerome's code (see a snippet below). At the moment I have it in a file called graph.py. It's not ready for merging due to its reliance on Pandas and inconsistent naming, but do you think the overall organisation makes sense? I will rename every instance of edge to iedge given our discussion earlier.

I notice an error in your code that I also kept making: naming an instance of a GIG object gig, when you have already imported the package using that name. I've switched to using import GeneticInheritanceGraph as gig_lib to avoid this. I'm not convinced gig_lib is the best name to use, but I do think we should reserve the name gig for class instances just like ts in tskit.

Another really useful function (which I will need) is a conversion function to take a location between child_left and child_right, and convert it to a parent position. I'm not sure of the best name, perhaps edge.parent_position(child_position) (and `edge.child_position(parent_position))? Suggestions @duncanMR ?

I've implemented this below in the translate_coordinate_along_edge helper function, which is for a particular iedge; it is used in the GIG class method translate_coordinate to find the set of all x values in a parent from an input x in the child's genome, and similarly for an input x in the parent's genome. I made direction an argument to avoid repetition, but it's probably wise to at least have aliases for each direction. I'm happy with your names for such aliases; do you mean that you want to make them methods for a particular iedge, or for the IEdges class?

def translate_coordinate_along_edge(edge, x, direction):
    if direction == "up":
        if edge.child_left <= x < edge.child_right:
            parent_length = edge.parent_right - edge.parent_left
            child_length = edge.child_right - edge.child_left
            if parent_length == -child_length:
                # inversion
                return edge.parent_left - (x - edge.child_left)
            elif edge.child_left != edge.parent_left:
                # shift
                return x - (edge.child_right - edge.parent_right)
            else:
                # no change
                return x

    elif direction == "down":
        if (
            edge.parent_left <= x < edge.parent_right
            or edge.parent_left > x >= edge.parent_right
        ):
            parent_length = edge.parent_right - edge.parent_left
            child_length = edge.child_right - edge.child_left
            if parent_length == -child_length:
                # inversion
                return edge.child_left + (edge.parent_left - x)
            elif edge.child_left != edge.parent_left:
                # shift
                return x + (edge.child_right - edge.parent_right)
            else:
                # no change
                return x
    else:
        raise ValueError(f"Invalid direction {direction}")

class GIG:
    """
    A GIG is an ARG in which edges are annotated with coordinate
    transform functions. Nodes in a GIG are integers, and represent
    specific ancestral genomes (identical to an ARG). Edges
    represent a genetic inheritance relationship between two
    genomes. Finally, coordinate transforms define what parts
    of a genome were inherited and the effects of any structural
    variants that may have occured.
    """

    def __init__(self, table_group):
        self.intervals = table_group.intervals._df()
        self.nodes = table_group.nodes._df()
        self.node_explorer = NodeExplorer(self, 0)
        pass

    @property
    def num_nodes(self):
        """
        Return the number of nodes in the GIG
        """
        return len(self.nodes)

    def parents(self, u):
        """
        Return the parents of node u in the graph.
        """
        all_parents = self.intervals[self.intervals.child == u]["parent"]
        return all_parents.drop_duplicates()

    def children(self, u):
        """
        Return the children of node u in the graph.
        """
        all_children = self.intervals[self.intervals.parent == u]["child"]
        return all_children.drop_duplicates()

    def transate_coordinate(self, p, c, x_in, direction, capture=False):
        """
        Transforms a coordinate in one node to a (possibly empty) set of
        coordinates in an adjacent node. If direction is "up", then the 
        x value is assumed to be in the coordinate space of the child node;
        if "down", it is assumed to be in the coordinate space of the parent.

        The NodeExplorer class records the inputs and outputs of the 
        transformation as a tree is constructed from the graph;
        if capture is True, then the edge between p and c is recorded
        in the edges table of the tree sequence being constructed.
        """
        edge_family = self.intervals.query("parent == @p and child == @c")
        X = []
        for _, edge in edge_family.iterrows():
            x_out = translate_coordinate_along_edge(edge, x_in, direction)
            if x_out is not None:
                self.node_explorer.update(
                    edge=edge,
                    direction=direction,
                    x_in=x_in,
                    x_out=x_out,
                    capture=capture,
                )
                X.append(x_out)

        return set(X)

...
hyanwong commented 11 months ago

This looks good, thanks @duncanMR . I think we should be using "span" rather than "length". In fact, in my version (also called graph.py!), I have something like this:

@property
        def child_span(self):
            return self.child_right - self.child_left

        @property
        def parent_span(self):
            return self.parent_right - self.parent_left

        @property
        def span(self):
            return self.child_span if child_span >= 0 else self.parent_span

        @property
        def is_inversion(self):
            return self.child_span < 0 or self.parent_span < 0
hyanwong commented 11 months ago

I could push my graph.py file in a PR so you can see the sort of thing I'm thinking about. We could have "to_df" methods on the various tables if you need conversion to a pandas data frame.

duncanMR commented 11 months ago

This looks good, thanks @duncanMR . I think we should be using "span" rather than "length". In fact, in my version (also called graph.py!), I have something like this:

@property
        def child_span(self):
            return self.child_right - self.child_left

        @property
        def parent_span(self):
            return self.parent_right - self.parent_left

        @property
        def span(self):
            return self.child_span if child_span >= 0 else self.parent_span

        @property
        def is_inversion(self):
            return self.child_span < 0 or self.parent_span < 0

That's a good point; I'll use span in future.

I could push my graph.py file in a PR so you can see the sort of thing I'm thinking about. We could have "to_df" methods on the various tables if you need conversion to a pandas data frame.

That would be great! I wasn't planning on pushing any of my code soon, since it is focused on making visualisations of simple examples.

hyanwong commented 11 months ago

I noticed an error in your code that I also kept making: naming an instance of a GIG object gig, when you have already imported the package using that name.

Yes, good spot, and a telling mistake. We could, I suppose, use g as the name of the graph variable, but gig is more expressive. As you say, gig_lib or gigl (😆 ) might work. Or genigraph? I'm open to other suggestions. We don't use tskit that often in raw code: mainly for reading in files and for constants.

duncanMR commented 10 months ago

I think import GeneticInheritanceGraphLibrary as gigl is a sensible and fun choice! The length isn't ideal, but I haven't thought of a shorter name that works.

hyanwong commented 10 months ago

Great, thanks. I agree. I will return to this after PopGroup. gigl also rolls off the tongue nicely.

hyanwong commented 10 months ago

Closing this as it's now done.