tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Support missing data in alignments #1896

Open jeromekelleher opened 2 years ago

jeromekelleher commented 2 years ago

1825 added support for alignments, which passed through arguments for supporting missing data to the haplotypes method. However, the initial implementation was wrong. Consider the following example:

    # 2.00┊   4     ┊
    #     ┊ ┏━┻┓    ┊
    # 1.00┊ ┃  3    ┊
    #     ┊ ┃ ┏┻┓   ┊
    # 0.00┊ 0 1 2 5 ┊
    #     0        10
    #      |      |
    #  pos 2      9
    #  anc A      T
    @tests.cached_example
    def ts(self):
        ts = tskit.Tree.generate_balanced(3, span=10).tree_sequence
        tables = ts.dump_tables()
        tables.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0)
        tables.sites.add_row(2, ancestral_state="A")
        tables.sites.add_row(9, ancestral_state="T")
        tables.mutations.add_row(site=0, node=0, derived_state="G")
        tables.mutations.add_row(site=1, node=3, derived_state="C")
        return tables.tree_sequence()

Then if we call ts.as_fasta(reference_sequence="0123456789") we get

>n0
01G345678T
>n1
01A345678C
>n2
01A345678C
>n5
01N345678N

I think this is wrong because we're filling in the sites with Ns for n5, but we're returning the reference for all other sites, despite the topology stating that n5 is unknown for the full sequence length. So, I think n5 should be all "N"s here.

In general, I think we should return missing data for any part of the sequence where a node is isolated (there's a corner case here to consider regarding mutations above those nodes at sites in the missing region, though). This won't be possible without implementing alignments in C. If we're doing that, we should also probably think about how to implement indels etc (see #1775), which are now logically possible within the discrete-genome alignments framework.

benjeffery commented 2 years ago

If n5 was connected to the topology at site 9 then would you use the reference directly after site 2 or only start at 9? Not sure I'm clear on how we know where on the reference the Ns begin and end.

jeromekelleher commented 2 years ago

I think it's straightforward enough logically, we just treat the reference sequence like imaginary entries in the Site table which have no mutations. So, we should get the same result from a call to ts.haplotypes in which we've added extra sites for every position in the genome with ancestral state equal to the reference base as we do from ts.alignments() (without these extra sites). That fully defines the missing data behaviour then in terms of existing machinery.

benjeffery commented 2 years ago

Ahh, ok. I was thinking about this incorrectly.

hyanwong commented 1 year ago

NB: now that tsinfer sets the start and end of the genome to missing, this issue means you can't normally output alignments for tsinferred tree sequences, which is a bit of a shame.

Jazpy commented 1 month ago

Hi all! I believe that the alignment function is not properly handling single-node tree sequences, specifically the _has_isolated_samples function in tskit. The following code reproduces the crash when simulating a sequence for a single node:

import msprime
import tskit

# Build demography object
dem = msprime.Demography()
dem.add_population(name='A', initial_size=1000)
sample = msprime.SampleSet(1, population='A', ploidy=1)

# Run coalescent simulation
trees = msprime.sim_ancestry([sample], sequence_length=10000, demography=dem)

# Get alignment for a single node
anc_seq = tskit.random_nucleotides(10000)
seqs = trees.alignments(reference_sequence=anc_seq, samples=[0])

# Crashes
next(seqs)

I am unsure if development of the C implementation is ongoing, or if it'd be worth it to patch the current python implementation of this function to return False if the root is also the only node in the tree. Thank you for reading!

jeromekelleher commented 1 month ago

What we get is:

Traceback (most recent call last):
  File "/home/jk/tmp/bug.py", line 17, in <module>
    next(seqs)
  File "/home/jk/.local/lib/python3.10/site-packages/tskit/trees.py", line 5572, in alignments
    raise ValueError(
ValueError: Missing data not currently supported in alignments; see https://github.com/tskit-dev/tskit/issues/1896 for details.The current implementation may also incorrectly identify an input tree sequence has having missing data.

So yeah, this a corner case. I guess given that this has a single node and no edges or mutations, then the alignment is unambiguously the ancestral state at each site and there's no problem.

How did you come across this issue @Jazpy? Is it worth putting in a special case here for the one-node, zero edges, zero mutations corner case?