Open nspope opened 7 months ago
Here's a MWE:
import tskit
tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(flags=1, time=0)
tables.nodes.add_row(time=1)
tables.edges.add_row(parent=1, child=0, left=0, right=1)
for _ in range(2):
i = tables.nodes.add_row(flags=1, time=0)
tables.edges.add_row(parent=1, child=i, left=0, right=1)
tables.sites.add_row(position=0, ancestral_state='a')
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)
tables.sort()
ts = tables.tree_sequence()
ts.allele_frequency_spectrum(sample_sets=[[0]], polarised=True)
The issue here is that all the mutations to the same allele (here b
) get counted here; so the algorithm thinks "okay here's an allele that's been inherited by 2 out of these 1 samples", whoops.
The mutation.parent is what's letting us properly count mutations that are nested in others; Even in cases without the bug assert, afs is reporting the wrong thing (heck, probably all the stats are).
Thanks Peter -- so, is the tskit_bug_assert the right exit point here, or should it raise a more specific error? If the former I'll go ahead and close this.
Don't close it, I think? It would be nice to give a more informative error. The problem is definitely not with allele_frequency_spectrum
, since having correct mutation parents is something we assume in a bunch of places. The right thing to do would be to check mutation parents when loading the tree sequence - and the way to do this is to re-compute them and see if we get the same answer. IIRC we discussed this, but decided it was kinda costly, so didn't want to do it. A more informative error here wouldn't fix the problem because other operations would also turn up wrong info. So, maybe we just need to check this on load, maybe with a flag to disable the check? Either that or decide not to fix it?
For context, I think we decided previously that "gee there aren't very many softwares producing tree sequences; they'll just all need to learn to use compute_mutation_parents
". Which is I guess what's happening.
There's a general issue here that we've been too lax with checking mutation parents. We'll probably just have to figure out how to detect whether mutation parents are set properly at load time, and start raising errors if not. Otherwise these types of error will start creeping in more and more.
We'll probably just have to figure out how to detect whether mutation parents are set properly at load time
I'm pretty sure the only reasonable way to do that is to just run compute_mutation_parents
again: to check mutation parents are right, we've got to build and traverse the trees; and that's what compute_mutation_parents
does. I don't see any simpler way to check for correctness, other than e.g. "if there's only one mutation per site then it's fine".
Your probably right. We could check for cheaper things like if we have two mutations on a branch, the parent must be set, but that's a fairly limited check.
So, I guess the proposal would be to run compute_mutation_parents
if we see more than one mutation at any site, and check?
with: afs_error.trees.gz
This example has one tree and one site. The segfault seems to be linked to the fact that there are multiple mutations at the site, but
mutations.parent
is tskit.NULL even when mutations are beneath another-- that is, if I dump tables and correct the mutations.parent column to reflect what mutation replaced what, then everything works.This is output from SINGER, so I'll let them know that mutations.parent should be set during conversion to a tree sequence. But, it'd be nice to have the above example run through, or at least throw an informative error (note that ts.diversity works, although it returns a negative number for the "sample_sets" case above).