tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Too many breakpoints even in simple cases #155

Open hyanwong opened 5 years ago

hyanwong commented 5 years ago

Tsinfer often creates too many breakpoints, even in genomic regions where there should be no conflict. A simple test case can be generated like this:

ts = msprime.simulate(
    10, mutation_rate=0.35, recombination_rate=1e-3, length=1e3, 
    random_seed=12345678).slice(80, 220)
print("# trees =", ts.num_trees, "; # mutations =", ts.num_mutations)
samples = tsinfer.SampleData.from_tree_sequence(ts)
ts_inferred = tsinfer.infer(samples, path_compression=False)
print("Inferred: # trees =", ts_inferred.num_trees)

Here's what the trees look like, true trees on the left, inferred on the right:

Screenshot 2019-06-17 at 16 50 52

I'm using this issue to collect thoughts & discussion on how to solve the problem.

hyanwong commented 5 years ago

The main reason for these excess breakpoints seems to be that, when matching ancestors, around a breakpoint there are many places where the L&S process can switch with similar probabilities from copying one ancestor to copying another (the regions where the 2 potential ancestors are identical).

The current algorithm picks one of these places to switch (presumably usually the furthest apart). But the leftmost likely place for an ancestor high up in the TS may be different from one low down. Here's a slimmed down picture of the example above, with a slightly modified ancestor building algorithm that breaks ancestors building on any incompatibility in the tracked samples.

The true breakpoint is in red, but you can see that there is a switch in ancestor 6 from copying from ancestor 4 to copying from ancestor 3 about 5 variant sites to the left of where it should occur. This sort of thing, happening differently for different ancestors, is what is causing us to surround each true breakpoint with a series of tree changes. Screenshot 2019-06-17 at 12 00 21

hyanwong commented 5 years ago

Gil has suggested a couple of ways forward:

  1. Look for recombination breakpoints in the sample data matrix using the Hudson & Kaplan algorithm, and bias switching in the L&S algorithm to those breakpoint regions.

  2. Look for likely breakpoints in the generated ancestors, either during ancestor building or just after (but before matching). a. (Simpler) - generate all ancestors, look for possible breakpoints across all of them (e.g. where haplotypes end, or where conflicts occur), assign probabilities to these points for each ancestor, and combine the probabilities across ancestors, creating a single probability between each variant site. Then feed this to the L&S as recombination probabilities b. "Look down" after creating ancestors, including only recombination probabilities from potential descendants of the current ancestor. This could be different to the left and the right of the breakpoint. It requires some working out to find how we might identify descendants without actually doing the matching process, but it could rely on the nested property of mutations under the perfect phylogeny criterion.

With (2), we can also take into account the fact that during the ancestor building process we have information (i) about likely breakpoints (each time we detect a conflict and evict a sample from the building algorithm) and (ii) about the likelihood of this being a real ancestral fragment from which copying could occur (this probability will extend over longer regions, rather than being localised to a single recombination hotspot, and should be is lower to the extreme ends of the fragment)

Both of these rely on https://github.com/tskit-dev/tsinfer/issues/154

hyanwong commented 5 years ago

Just to document here that the current L&S algorithm doesn't take account of anything apart from haplotype compatibility - not even length between variant sites.

hyanwong commented 5 years ago

For reference, one implementation of the HK algorithm is at https://github.com/RILAB/rmin_cut

hyanwong commented 4 years ago

Here's an updated way to draw this pathological case in a Jupyter notebook using the new SVG drawing routines:

import msprime
import tsinfer
from IPython.display import SVG

ts = msprime.simulate(
    10, mutation_rate=0.35, recombination_rate=1e-3, length=1e3, 
    random_seed=12345678).keep_intervals([[80, 220]]).trim()
print("# trees =", ts.num_trees, "; # mutations =", ts.num_mutations)
samples = tsinfer.SampleData.from_tree_sequence(ts)
ts_inferred = tsinfer.infer(samples, path_compression=False)
print("Inferred: # trees =", ts_inferred.num_trees)

width = 1200
height = 800

SVG(f'<svg height="{height}" width="{width}">' +
     ts.draw_svg(size=(width, height/2)) +
     ts_inferred.draw_svg(
         size=(width, height/2), root_svg_attributes={'id': 'inverted_bg', 'y':height/2}, x_label="",
         style="""
             #inverted_bg .background {transform: translate(0, %spx) scale(1, -1)}
             #inverted_bg .trees {transform: translate(0, 40px)}
             #inverted_bg .x-axis {transform: scale(1, -1) translate(0, -%spx)}
             .x-axis .site {display: none}
             #inverted_bg .x-axis text {transform: scale(1, -1)}""" % (height/2, height/2)) +
    "</svg>")

giving something like the following:

Screenshot 2021-04-21 at 11 16 20

hyanwong commented 4 years ago

I think this paper might show how to use a pBWT approach to identify the possible breakpoint regions, which we could feed to the recombination map: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7008532/

hyanwong commented 1 year ago

Perhaps a better approach here would be to record both the leftmost and rightmost possible breakpoint positions from the HMM. We could commit to using a particular breakpoint for the purposes of creating a tree sequence for compressing the likelihoods, but gradually narrow down the breakpoint if we get an overlapping one with identical parents within the same region.

This is a fairly complex algorithm, and it would be good to sketch it out both on a board and possibly templated with pseudocode or e.g. using Duncan's lshmm.