Plot the immediate ancestors of a sample across the genome

duncanMR commented 1 year ago

Given the overwhelming complexity of big inferred tree sequences, I find it difficult to visualise their fine-grain structure without simplifying them. One approach I've developed with @jeromekelleher is to plot parts of the edges table. We pick a sample node, then for each tree in the ts we check what the parent of the sample node is, on what the parent of that parent is. We record the parent and grandparent nodes in an array, along with their times. Here is the code I wrote to do that:

    nodes_time = ts.nodes_time
    mutations = ts.tables.mutations
    sites = ts.tables.sites
    parent = np.zeros(ts.num_trees, dtype=int)
    parent_time = np.zeros(ts.num_trees, dtype=float)
    grandparent = np.zeros(ts.num_trees, dtype=int)
    grandparent_time = np.zeros(ts.num_trees, dtype=float)

    for tree in ts.trees():
        n = tree.index
        parent[n] = tree.parent(sample)

        if parent[n] == -1:
            parent_time[n] = np.nan
            grandparent_time[sample] = np.nan
            grandparent[n] = -1
        else:
            parent_time[n] = nodes_time[n]
            grandparent[n] = tree.parent(parent[n])

            if grandparent[n] == -1:
                grandparent_time[n] = np.nan
            else:
                grandparent_time[n] = nodes_time[grandparent[n]]

    parent_df = make_genomic_intervals_from_nodes_array(parent, parent_time, ts).dropna()
    grandparent_df = make_genomic_intervals_from_nodes_array(grandparent, grandparent_time, ts).dropna()
    sample_mutations = get_mutation_sites([sample], mutations, sites)
    parent_mutations = get_mutation_sites(parent_df.parent, mutations, sites)

The function that turns the numpy array into a dataframe of intervals is as follows, along with the mutation function:

def make_genomic_intervals_from_nodes_array(nodes, times, ts, log_time_offset=0.01):
    """
    Given an array of node_ids of length ts.num_trees, where each node id is
    a parent of some node, this function returns a dataframe with genomic intervals for
    each parent in the array. For example, if the array is [2,2,3] and the breakpoints
     of the ts are [0, 10, 30, 50] then
    the function returns a dataframe with rows {parent: 2, left: 0, right: 30} and
    {parent: 3, left: 30, right: 50}.
    """

    # Prepend an extra element that's different to make first interval from 0
    nodes_extended = np.insert(nodes, 0, np.min(nodes) - 1)
    changes = np.diff(nodes_extended) != 0
    left_indices = np.where(changes)[0]
    # Append an extra element to make last interval to the end of the genome
    right_indices = np.append(left_indices[1:], ts.num_trees)

    breakpoints = ts.breakpoints(as_array=True)
    parent_node = nodes[left_indices]
    parent_time = times[left_indices]
    parent_flag = ts.nodes_flags[parent_node]
    parent_type = vectorized_describe_flag(parent_flag)

    df = pd.DataFrame(
        {
            "parent": nodes[left_indices],
            "left": breakpoints[left_indices],
            "right": breakpoints[right_indices],
            "time": parent_time,
            "log_time": np.log10(parent_time + log_time_offset),
            "node_type": parent_type,
        }
    )

    return df

def get_mutation_sites(nodes, mutations, sites):
    """
    Given a list of nodes, return the positions of mutations that are on those nodes.
    """
    node_mutations = mutations[np.isin(mutations.node, nodes)]
    node_mutations_position = sites[node_mutations.site].position
    return node_mutations_position

Here is an example of a plot, using the same data as #23:

In this case, the sample is part of a trio, so there are only two immediate ancestors (one for each of the parent's recombinant chromosomes). The plots from a sample of a real trio (taken from the 100 000 Genomes Project) are much more fragmented:

I'm not sure if the algorithm is efficient enough to be included in the QC notebook. I do think the plots could be improved, e.g. I'd rather use mbp units on the x axis, and the text summary strings and titles might be too much information. Any feedback would be appreciated!

jeromekelleher commented 1 year ago

I think this is something we'll want to include as a "debugging tool", which helps people track down dodgy samples. We won't run it automatically, but give some guidance in a tutorial about how you would use it diagnose problems.

So, we don't need to worry about it being particularly efficient.

petrelharp commented 1 year ago

Nice!!

tskit-dev / tsbrowse

Plot the immediate ancestors of a sample across the genome #25