tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Can I get a list of descendants of a branch (edge)? #2882

Open hanbin973 opened 7 months ago

hanbin973 commented 7 months ago

Hi there, thank you for developing a terrific library. I'm trying to perform certain matrix operations using tskit.

Suppose that A is a matrix in which the number of rows (N) are individuals, and the number of columns (E) are edges (brancehs). The value is 1 if the node is a descendant of an edge and 0 otherwise (not only the direct descendant but all descendants). A is the same as the genotype matrix, given one mutation for each branch.

I'm willing to multiply this matrix A with a phenotype matrix X of individuals (X = N x P, where P is the number of phenotypes) to get A^TX. Is this possible within the statistics functionality of tskit?

jeromekelleher commented 7 months ago

Hi @hanbin973, welcome :wave:

Here's an inefficient (and totally untested!) first pass:

# E is the set of samples that descend from each edge.
E = [set() for _ in range(ts.num_edges)]
for tree in ts.trees():
    for u in tree.nodes():
          if tree.edge(u) != -1:
              E[tree.edge(u)] |= tree.samples(u)

A = np.zeros((ts.num_samples, ts.num_edges))
for e in range(ts.num_edges):
     for u in E[e]:
            # WARNING! This assumes that samples are 0..., n-1 which isn't true in general
            A[u, e] += 1

Is this the object you want to compute? If so, we can think about how to make it faster using approaches discussed in #2869

hanbin973 commented 7 months ago

Great! It's exactly what I was looking for. I'm trying to do some sort of complex trait analysis, so I'll return after reading #2869.

hanbin973 commented 6 months ago

The following is an adoption from the docs. The following performs weight propagation and stores weights similar to the mode="node" statistics option. This does not invoke large matrices like A explicitly but does the addition defined by the product of a phenotype vector and A.

import tskit
import numpy as np
import msprime
import itertools
import operator

def edges_by_parent_timeasc(ts):
    return itertools.groupby(ts.edges(), operator.attrgetter("parent"))

parameters = {
    "samples": 3, # Three diploid individuals == six sample genomes
    "sequence_length": 1e4,
    "recombination_rate": 1e-7,
    "population_size": 1e3,
    "random_seed": 333,
}

ts = msprime.sim_ancestry(**parameters, discrete_genome=False)

wt_samples = np.ones(ts.num_samples) # define sample weights
wt_nodes = np.empty(ts.num_nodes) 
wt_nodes[:ts.num_samples] = wt_samples

for parent_id, edges in edges_by_parent_timeasc(ts):
    t = ts.node(parent_id).time
    children = list({e.child for e in edges})
    wt_nodes[parent_id] = np.sum(wt_nodes[children])
    print(f"Node {parent_id} at time {t} has these child node IDs: {children}, weights: {wt_nodes[parent_id]}")

print('\n')

for e in ts.edges():
    print(f"Edge {e.id} has weight: {wt_nodes[e.child]}")

Result:

Node 6 at time 56.0454802286223 has these child node IDs: [1, 3], weights: 2.0
Node 7 at time 120.7850821708138 has these child node IDs: [5, 6], weights: 3.0
Node 8 at time 142.86643656492956 has these child node IDs: [4, 5], weights: 2.0
Node 9 at time 236.7237255340037 has these child node IDs: [0, 2], weights: 2.0
Node 10 at time 272.05118690557765 has these child node IDs: [9, 6, 7], weights: 7.0
Node 11 at time 610.912738605865 has these child node IDs: [8, 10, 4], weights: 10.0
Node 12 at time 1382.4658244607965 has these child node IDs: [8, 10], weights: 9.0

Edge 0 has weight: 1.0
Edge 1 has weight: 1.0
Edge 2 has weight: 1.0
Edge 3 has weight: 2.0
Edge 4 has weight: 1.0
Edge 5 has weight: 1.0
Edge 6 has weight: 1.0
Edge 7 has weight: 1.0
Edge 8 has weight: 2.0
Edge 9 has weight: 3.0
Edge 10 has weight: 2.0
Edge 11 has weight: 1.0
Edge 12 has weight: 2.0
Edge 13 has weight: 7.0
Edge 14 has weight: 2.0
Edge 15 has weight: 7.0

If I'm correct, the takeaway is that edge weights (here, the sum of phenotypes of all direct/indirect descendant nodes) can be stored more compactly at nodes because the number of edges far exceeds the number of nodes.

jeromekelleher commented 6 months ago

That's very nice @hanbin973! Very clean way of doing it.

Here's an implementation using numba, which should be quite fast:

import numba

@numba.njit
def _num_graph_descendant_samples(edges_parent, edges_child, W):
    last_parent = -1
    children = set()
    for e in range(edges_parent.shape[0]):
        if edges_parent[e] != last_parent:
            if last_parent != -1:
                for c in children:
                    W[last_parent] += W[c]
            children.clear()
        children.add(edges_child[e])
        last_parent =  edges_parent[e]
    if last_parent != -1:
        for c in children:
            W[last_parent] += W[c]

def num_graph_descendant_samples(ts):
    """
    Returns the number of samples that are reachable from each in the graph,
    i.e., descend from a particular node in at least one local tree.
    """
    W = np.zeros(ts.num_nodes)
    W[ts.samples()] = 1
    _num_graph_descendant_samples(ts.edges_parent, ts.edges_child, W)
    return W
hanbin973 commented 6 months ago

Thank you for the nice refactoring. The next step is to find a way to multiply A from the edge side. If A is an N(sample nodes) E(edges) matrix as before, how should we do Ab where b is an E * 1 vector?

This expression frequently appears in regression contexts like E(y|A) = Ab (A is the explanatory variable and b is the coefficient of edges).

My idea is that traveling the DAG upward for each node will do.

I guess there's something we can borrow from tstrait ?

hanbin973 commented 6 months ago

According to my quick look-up, the implementation of tstrait won't be able to handle the situation as it iterates over causal sites. I assume that it invokes an intermediate genotype vector?

Let N be the number of sample nodes, E be the number of edges (as before), and I be the number of all nodes, including the internal ones.

Let B be an I * E matrix that is 1 if the i-th node is the direct child of the e-th edge (0 otherwise). Hence, this matrix is sparse with only E non-zero elements. The matrix can be directly obtained from ts.edges_child. I'll chose to store this as a CSC (column-first) sparse matrix because it's easier to define.

import scipy.sparse as sparse
B = sparse.csc_matrix(
    (np.ones(ts.num_edges), ts.edges_child, np.arange(ts.num_edges+1)), 
    shape=(ts.num_nodes, ts.num_edges)
)

or we could just loop over the edges to save memory.

Let C be an N I matrix that is 1 if the n-th sample is a descendant (including indirect relationship) of the i-th node. An important observation is that the A matrix (an N E matrix that was defined previously) factorizes to A = C B. Hence, Ab = (C B) b = C (B * b) (by associativity). The matrix-vector multiplication can be done efficiently.

Bb = B.dot(b)

The remaining step is, for each sample, to traverse upward and add elements of Bb[i] for node i that it meets. I think a depth-first search is appropriate here, but I'm not sure yet how this could be efficiently done on tree sequences.

hanbin973 commented 6 months ago

This is the actual code that works. I didn't notice that @jeromekelleher had already wrote a depth-first search in ARGs. The following is minor modification to his code in #2869.

def graph_mv_nodes(ts, arg, vec): # a better name for the function? graph + mv (matrix-vector product) + nodes (side of the multiplication)
    """
    Perform a matrix-vector multiplication through a depth-first search
    vec: length is number of nodes
    out: length is number of samples
    """

    out = np.zeros(ts.num_samples, dtype=np.float32) # declare outcome matrix

    for s in ts.samples():
        u = s
        out[s] += vec[s] 
        is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
        is_ancestor[u] = True
        stack = [u]
        while len(stack) > 0:
            u = stack.pop()
            for j in range(*arg.parent_range[u]):
                e = arg.parent_index[j]
                e_parent = ts.edges_parent[e]
                if not is_ancestor[e_parent]:
                    # Note: setting is_ancestor here because we can
                    # push the same node on the stack multiple times otherwise
                    is_ancestor[e_parent] = True
                    stack.append(e_parent)
                    # update out vector
                    out[s] += vec[e_parent]

    return out

def graph_mv_samples(ts, arg, vec): # a better name for the function? graph + mv (matrix-vector product) + samples (side of the multiplication)
    """
    Perform a matrix-vector multiplication through a depth-first search
    vec: length is number of samples
    out: length is number of nodes
    """

    out = np.zeros(ts.num_nodes, dtype=np.float32) # declare outcome matrix

    for s in ts.samples():
        u = s
        is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
        is_ancestor[u] = True
        stack = [u]
        while len(stack) > 0:
            u = stack.pop()
            for j in range(*arg.parent_range[u]):
                e = arg.parent_index[j]
                e_parent = ts.edges_parent[e]
                if not is_ancestor[e_parent]:
                    # Note: setting is_ancestor here because we can
                    # push the same node on the stack multiple times otherwise
                    is_ancestor[e_parent] = True
                    stack.append(e_parent)

        out += is_ancestor * vec[s]

    return out

Execution code

ts = msprime.sim_ancestry(
    8,
    ploidy=1,
    recombination_rate=0.1,
    sequence_length=10,
    random_seed=1234,
    record_full_arg=False,
)

arg = make_arg(ts)

print("========= matrix * vector from node side ========")

# create dense C
Cdense = np.zeros((ts.num_samples, ts.num_nodes))
for s in ts.samples():
    Cdense[s,:] = graph_ancestors(ts, arg, s)

# define vector
beta = np.random.normal(size=ts.num_nodes)

# naive product
out_naive = Cdense @ beta
print(out_naive)

# graph product
out_graph = graph_mv_nodes(ts, arg, beta)
print(out_graph)

# check result
print('The result is identical: %s' % np.allclose(out_naive, out_graph))

print("========= matrix * vector from sample side ========")

# define vector
beta = np.random.normal(size=ts.num_samples)

# naive product
out_naive = beta @ Cdense
print(out_naive)

# graph product
out_graph = graph_mv_samples(ts, arg, beta)
print(out_graph)

# check result
print('The result is identical: %s' % np.allclose(out_naive, out_graph))

Result

========= matrix * vector from node side ========
[ 0.75155855 -0.14588287 -0.32459718 -0.55193155  2.78580209 -2.29843114
 -0.29839819 -0.74230104]
[ 0.7515586  -0.14588289 -0.3245972  -0.55193144  2.7858024  -2.2984312
 -0.29839817 -0.74230105]
The result is identical: True
========= matrix * vector from sample side ========
[-0.52319413  0.23759605 -0.22814008  0.40245844 -1.30527328  1.05274019
 -0.07652085 -0.00678606  0.64005448 -1.53341336  0.63326842 -2.05660749
 -1.42333907  0.97621934 -0.44711973]
[-0.52319413  0.23759605 -0.22814009  0.40245843 -1.3052733   1.0527402
 -0.07652085 -0.00678606  0.64005446 -1.5334134   0.6332684  -2.0566075
 -1.4233391   0.97621936 -0.44711977]
The result is identical: True
petrelharp commented 6 months ago

I've not sat down to think carefully about the comparison, but here's an algorithm that does something equivalent: https://github.com/tskit-dev/tskit/issues/1547#issuecomment-882235129 (it's looking at mutations rather than edges, but the basic idea is the same).

hanbin973 commented 6 months ago

Thank you for the suggestion @petrelharp! It's a bit difficult for me to understand your algorithm. Could you expand?

jeromekelleher commented 6 months ago

Very nice @hanbin973! It should be fairly straightforward to accelerate your algorithms above using numba and you should be able to use them on pretty tree sequences then. Let me know if you'd like some help with this.

There's probably a bit of improvement to be made by doing the upward traversals for each sample at the same time (rather than sample-by-sample) but I haven't thought much about this.

hanbin973 commented 6 months ago

Agreed @jeromekelleher. I think it's possible to propagate mutations "downwards" somehow, rather than climbing upwards. The simultaneous downward approach could remove redundant calculations, e.g. some mutations share descendants so the shared path is passed through multiple times if not done simultaneously.

@petrelharp 's algorithm seems to remove the redundancies. I tried to reproduce his idea without really understanding it, and the result dosen't seem to be correct.

For example, for each edge, climb up to the root and record the intermediate nodes in a stack. This would take at most "depth of the tree" steps so O(logN). Next, pop out the elements from the stack and add x[v] to x[u]. This is climbing down the path so also O(logN). Repeating over edges will therefore be O(num edges * logN). I think some duplicate additions are being made here, and my rough guess is that I'm not following how the values of x are being updated when the descendant samples change over trees. Such an incremental algorithm would be definitely desirable but I really need concrete examples to fully implement it myself.