tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Add augment_samples() functionality #189

Open hyanwong opened 5 years ago

hyanwong commented 5 years ago

There are many cases where we might want to throw in a few extra samples to match after we have inferred a full tree sequence, but without going through the whole inference pathway again. It's also quite possible that the samples we add will only contain some of the inference sites used in the original inference (and may contain some that were not in the originally inferred one).

This is the case, for instance, when @awohns wants to add a few fragmentary ancient samples into the 1000G tree sequence. It's also similar (possibly identical) to what we did when matching the UKB samples against the ancestors tree sequence generated from the 1000G. And it's what external users might want to do when getting e.g. their own genome, and matching it against our 1000G inferred sequence.

~Most of the grunt work for doing this is in the code posted in https://github.com/tskit-dev/tsinfer/issues/187#issuecomment-529871334,~ although we could improve it by adding new-sample-specific-sites to the existing inferred ts using parsimony, if we felt the need.

The question for @jeromekelleher is whether we want to expose this functionality in a new API call, perhaps something like:

new_ts = tsinfer.augment_samples(old_ts, old_ancestors_ts, new_sample_data)
awohns commented 5 years ago

Yan and I just discussed this further and came to an agreement on three points:

  1. Four parameters seem necessary: (1) original tree sequence, (2) ancestors_ts, (3) original sample_data, (4) sample_data to augment
  2. An error should be thrown if the sample_data file to be used for augmenting is incompatible with the full matched tree sequence (i.e. samples have a variant which is younger than it is inferred to be in the original tree sequence)
  3. There might be a need for some checking of the meaning of 'time' in the two files: i.e. that node times refer to generations and not frequency
hyanwong commented 5 years ago
  1. There might be a need for some checking of the meaning of 'time' in the two files: i.e. that node times refer to generations and not frequency

see https://github.com/tskit-dev/tskit/issues/375

hyanwong commented 2 years ago

This would be useful for @szhan.

I can see a couple of ways to do this. You could match the (single) sample against the ancestors_ts then union the resulting TS with the original TS. Or you could take the original TS and somehow mask off the samples, turn it into an ancestors_ts, and match against that. The advantage to the second method is that you would be able to match against the synthetic ancestors produced from match_samples. The disadvantage is that it might be tricky to know which sites to match.

hyanwong commented 2 years ago

Here's some code that I think does the trick (untested, perhaps @szhan could take a look and try it out / write some tests?). It would be useful to label the nodes with metadata, I think, as all those changing numbers get very confusing.

import tsinfer
import tskit
import numpy as np

with tsinfer.SampleData(sequence_length=6) as sample_data:
    sample_data.add_site(0, [0, 1, 0, 0, 0], ["A", "T"])
    sample_data.add_site(1, [0, 0, 0, 1, 1], ["G", "C"])
    sample_data.add_site(2, [0, 1, 1, 0, 0], ["C", "A"])
    sample_data.add_site(3, [0, 1, 1, 0, 0], ["G", "C"])
    sample_data.add_site(4, [0, 0, 0, 1, 1], ["A", "C"])
    sample_data.add_site(5, [0, 1, 2, 0, 0], ["T", "G", "C"])

ancestors = tsinfer.generate_ancestors(sample_data)
ancestors_ts = tsinfer.match_ancestors(sample_data, ancestors)
inferred_ts = tsinfer.match_samples(sample_data, ancestors_ts, simplify=False)

print(inferred_ts.num_nodes)
print(inferred_ts.draw_text())

## augment_samples main code starts below

# check all the samples are at the end of the list of nodes (this should be the case for match_samples simplify=False)
assert np.all(inferred_ts.samples() == np.arange(inferred_ts.num_nodes-inferred_ts.num_samples, inferred_ts.num_nodes))

tables = inferred_ts.dump_tables()
tables.edges.clear()
tables.nodes.clear()

samples = set(inferred_ts.samples())
for n in inferred_ts.nodes():
    # remove the sample nodes
    # since these are all at the end of the table, the edge parent & child ids shouldn't be displaced
    if not n.is_sample():
        tables.nodes.append(n)
for e in inferred_ts.edges():
    assert e.parent not in samples
    # remove the edges to samples
    if e.child not in samples:
        tables.edges.append(e)

# TODO - what about mutations above the removed sample nodes? Seemingly tsinfer doesn't mind
# about those, but I guess we should remove them?

# All nodes in an ancestors ts are samples
tables.nodes.flags |= tskit.NODE_IS_SAMPLE

# ancestors_ts only contains inference sites
tables.delete_sites(
    np.where(
        np.logical_not(
            np.isin(
                inferred_ts.tables.sites.position,
                ancestors_ts.tables.sites.position
            )
        )
    )[0]
)

# this should be basically the same as the ancestors_ts above, but might include extra path compressed nodes
anc_ts = tables.tree_sequence()

# Make some new samples
with tsinfer.SampleData(sequence_length=6) as new_samples:
    new_samples.add_site(0, [1, 1], ["A", "T"])
    new_samples.add_site(1, [1, 0], ["G", "C"])
    new_samples.add_site(2, [0, 1], ["C", "A"])
    new_samples.add_site(3, [1, 0], ["G", "C"])
    new_samples.add_site(4, [1, 0], ["A", "C"])
    new_samples.add_site(5, [0, 1], ["T", "G", "C"])

# TODO - we should check that the allele definitions in anc_ts and new_samples are the same

new_ts = tsinfer.match_samples(new_samples, anc_ts, simplify=False, path_compression=False)
print(new_ts.draw_text())

tables = inferred_ts.dump_tables()
new_tables = new_ts.tables  # can be read-only
# Add new_ts samples back in as new nodes: these are all at the end of the node list
assert np.all(new_ts.samples() == np.arange(new_ts.num_nodes-new_ts.num_samples, new_ts.num_nodes))
samples = set(new_ts.samples())
node_map = np.arange(new_ts.num_nodes, dtype=tables.edges.child.dtype)
individuals_map = np.arange(new_ts.num_individuals, dtype=tables.nodes.individual.dtype)

new_individuals = set()
for s in new_ts.samples():
    row = new_tables.nodes[s]
    if row.individual != tskit.NULL and row.individual not in new_individuals:
        individual_object = new_ts.individual(row.individual)
        individuals_map[row.individual] = tables.individuals.append(individual_object)
        new_individuals.add(row.individual)
    node_map[s] = tables.nodes.append(
        row.replace(
            individual=(tskit.NULL if row.individual==tskit.NULL else individuals_map[row.individual]))
    )

for e in new_tables.edges:
    assert e.parent not in samples
    if e.child in samples:
        tables.edges.append(e.replace(child=node_map[e.child], parent=node_map[e.parent]))

## include mutations above the added-in samples
# Check the site IDs refer to the same sites
assert np.all(new_tables.sites.position == tables.sites.position)
for m in new_tables.mutations:
    if m.node in samples:
        tables.mutations.append(m.replace(node=node_map[m.node], parent=tskit.NULL))
tables.sort()
tables.build_index()
tables.compute_mutation_parents()
unified_ts = tables.tree_sequence()
print(unified_ts.draw_text())
hyanwong commented 2 years ago

I've also though of another (possibly better) way to do this, which needs discussion with @jeromekelleher . At the moment, an "ancestors tree sequence" is like a normal tree sequence except that all the internal nodes are marked as samples, and it only has the sites used for inference. I think we might be able to institute a convention that any nodes that are not marked as samples in an ancestors tree sequence should not be matched against (i.e. in this case tskit.NODE_IS_SAMPLE means "use for matching"). However, such nodes, and the edges that involve them, are inserted into the final output tree sequence, after all the "normal" sample matching has occurred. I think we can do this pretty simply, by only adding the nodes marked with tskit.NODE_IS_SAMPLE to the tree sequence builder here:

https://github.com/tskit-dev/tsinfer/blob/9f94dbc9d82393b0189e6c62853ee6b5da0038cc/tsinfer/inference.py#L1510

Before adding these in, it would probably be helpful to re-order the nodes so that the tskit.NODE_IS_SAMPLE ones are at the top of the node table (this means fixing https://github.com/tskit-dev/tskit/issues/2060), and their associated individuals are at the top of the individuals table too.

In the next line, we would only add edges in which both the child and parent point to nodes we have just added. We would also need to check that the resulting object is internally consistent (i.e. the paths up from every added node through the added edges all end up in the grand MRCA)

Then when we export to the final TS, we can add the non sample nodes (and the edges that involve them) back in to the final TS. When we initialise the nodes here:

https://github.com/tskit-dev/tsinfer/blob/9f94dbc9d82393b0189e6c62853ee6b5da0038cc/tsinfer/inference.py#L1678

We only add in the nodes we used for matching (i.e. the ones we have reordered to be at the start of the node table, with the tskit.NODE_IS_SAMPLE flag). Then, round about here:

https://github.com/tskit-dev/tsinfer/blob/9f94dbc9d82393b0189e6c62853ee6b5da0038cc/tsinfer/inference.py#L1745

we run through the ancestors_ts nodes and add to the tables all the nodes that are not flagged as tskit.NODE_IS_SAMPLE, keeping a map of the old node IDs to the newly inserted ones. We then run through the edges in the ancestors_ts and add those ones that involve the nodes in the map, changing the ids where necessaryby using the map (note: this assumes that the IDs of the original nodes don't change, which I think is guaranteed by the re-ordering above).

This gives us the possibility to use the NODE_IS_SAMPLE flag to limit matching agains certain ancestors, which is something we have wanted to do for a while (e.g. for matching ancient samples into a tree sequence, when we don't want to match against younger nodes).

In the case of matching new sample into an existing TS, we would use this facility to turn a "normal" tree sequence into an "fake" ancestors TS by simply toggling the NODE_IS_ANCESTOR flag, so that none of the original samples are matched against, but are included in right position in the final TS. We can then run the normal match_samples routine against this "fake" ancestors TS, using the new sample for matching, which will produce a TS with the new sample flagged as a sample and the old samples stuck in as non-sample nodes. All we then have to do is to (somehow) toggle the original sample nodes back to samples again, and we've created an augmented TS. The nice thing is that it uses the existing match_samples function, rather than creating another method. In other words, the pipeline would go:

inferred_ts = tsinfer.infer(samples, simplify=False)
ancestors_ts = tsinfer.create_ancestors_ts_from_ts(inferred_ts)  # <- this toggles the NODE_IS_SAMPLE flag
augmented_ts = tsinfer.match_samples(ancestors_ts, a_few_extra_samples)
augmented_ts = tsinfer.reflag_samples(augmented_ts)  # <- I'm not sure about this: perhaps match_samples could do it using metadata somehow?

As part of this, we will probably want to combine the metadata from the ancestors tree sequence and the new sample data file, which is something we want to consider for https://github.com/tskit-dev/tsinfer/issues/604#issuecomment-969099154, I think (i.e transfer metadata stored in an ancestors ts to the final ts).

There are a lot of intricacies to be worked out here (e.g what about inference vs non-inference sites), but I think it is worth considering because of the flexibility it gives us.

hyanwong commented 2 years ago

As well as removing e.g. non biallelic sites from the original tree sequence when making it into an ancestors tree sequence, there's a slight extra issue to think about here. It's well illustrated by thinking about singletons, although it's actually a bit more general. If the original tree sequence has a mutation above one of the samples, this generates a singleton site. But it is quite possible that the new samples (which are being matched into this tree sequence) have the derived variant at that site too. In this case, we might want to use the site for matching, even though none of the ancestors have that variant.

I think that principled way to do this is to add some new ancestors into the tree sequence, depending on the variation in the new samples and in the old tree sequence. For instance, if the new samples have a variable site that was a singleton in the original tree sequence, we should make a new ancestor and do some ancestor matching before we do the sample matching. This makes the whole pathway quite a lot more complicated, urgh. But it does allow us to consider resolving topology within the new samples (for example, imagine adding a pair of siblings as new samples: with the pathway outlined previously, we wouldn't recognise them as sibs: that requires us to add a new parent node to the ancestors tree sequence, deduced on the basis of shared mutations or breakpoints between the siblings)

Adding new ancestral nodes to the ancestors tree sequence, on the basis of any new added samples, is extra functionality which I think we should eventually implement, but perhaps not at first.

hyanwong commented 2 years ago

This needs to be thought about in the context of https://github.com/tskit-dev/tsinfer/issues/729, and the idea of adding to an existing tree sequence.