Open jeromekelleher opened 3 years ago
If n5
was connected to the topology at site 9 then would you use the reference directly after site 2 or only start at 9? Not sure I'm clear on how we know where on the reference the N
s begin and end.
I think it's straightforward enough logically, we just treat the reference sequence like imaginary entries in the Site table which have no mutations. So, we should get the same result from a call to ts.haplotypes
in which we've added extra sites for every position in the genome with ancestral state equal to the reference base as we do from ts.alignments()
(without these extra sites). That fully defines the missing data behaviour then in terms of existing machinery.
Ahh, ok. I was thinking about this incorrectly.
NB: now that tsinfer
sets the start and end of the genome to missing, this issue means you can't normally output alignments for tsinferred tree sequences, which is a bit of a shame.
Hi all! I believe that the alignment function is not properly handling single-node tree sequences, specifically the _has_isolated_samples
function in tskit
. The following code reproduces the crash when simulating a sequence for a single node:
import msprime
import tskit
# Build demography object
dem = msprime.Demography()
dem.add_population(name='A', initial_size=1000)
sample = msprime.SampleSet(1, population='A', ploidy=1)
# Run coalescent simulation
trees = msprime.sim_ancestry([sample], sequence_length=10000, demography=dem)
# Get alignment for a single node
anc_seq = tskit.random_nucleotides(10000)
seqs = trees.alignments(reference_sequence=anc_seq, samples=[0])
# Crashes
next(seqs)
I am unsure if development of the C implementation is ongoing, or if it'd be worth it to patch the current python implementation of this function to return False
if the root is also the only node in the tree. Thank you for reading!
What we get is:
Traceback (most recent call last):
File "/home/jk/tmp/bug.py", line 17, in <module>
next(seqs)
File "/home/jk/.local/lib/python3.10/site-packages/tskit/trees.py", line 5572, in alignments
raise ValueError(
ValueError: Missing data not currently supported in alignments; see https://github.com/tskit-dev/tskit/issues/1896 for details.The current implementation may also incorrectly identify an input tree sequence has having missing data.
So yeah, this a corner case. I guess given that this has a single node and no edges or mutations, then the alignment is unambiguously the ancestral state at each site and there's no problem.
How did you come across this issue @Jazpy? Is it worth putting in a special case here for the one-node, zero edges, zero mutations corner case?
1825 added support for alignments, which passed through arguments for supporting missing data to the haplotypes method. However, the initial implementation was wrong. Consider the following example:
Then if we call
ts.as_fasta(reference_sequence="0123456789")
we getI think this is wrong because we're filling in the sites with Ns for
n5
, but we're returning the reference for all other sites, despite the topology stating thatn5
is unknown for the full sequence length. So, I thinkn5
should be all "N"s here.In general, I think we should return missing data for any part of the sequence where a node is isolated (there's a corner case here to consider regarding mutations above those nodes at sites in the missing region, though). This won't be possible without implementing
alignments
in C. If we're doing that, we should also probably think about how to implement indels etc (see #1775), which are now logically possible within the discrete-genome alignments framework.