tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

Bug assertion in `ts.allele_frequency_spectrum` #2923

Open nspope opened 2 months ago

nspope commented 2 months ago

with: afs_error.trees.gz

import tskit
print(tskit.__version__) # '0.5.7.dev0'
ts = tskit.load("afs_error.trees")
ts.allele_frequency_spectrum() # works
ts.allele_frequency_spectrum(sample_sets=[[x for x in range(64)]]) # segfaults
# Bug detected in lib/tskit/trees.c at line 2721. If you are using tskit directly please open an issue on 
# GitHub, ideally with a reproducible example. (https://github.com/tskit-dev/tskit/issues) 
# Aborted (core dumped)

This example has one tree and one site. The segfault seems to be linked to the fact that there are multiple mutations at the site, but mutations.parent is tskit.NULL even when mutations are beneath another-- that is, if I dump tables and correct the mutations.parent column to reflect what mutation replaced what, then everything works.

This is output from SINGER, so I'll let them know that mutations.parent should be set during conversion to a tree sequence. But, it'd be nice to have the above example run through, or at least throw an informative error (note that ts.diversity works, although it returns a negative number for the "sample_sets" case above).

petrelharp commented 2 months ago

That's here. That code is only called if it's unpolarized, however, doing it polarized hits essentially the same bug assert here.

You can just tell them to run tables.compute_mutation_parents(); that will do it.

petrelharp commented 2 months ago

Here's a MWE:

import tskit

tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(flags=1, time=0)
tables.nodes.add_row(time=1)
tables.edges.add_row(parent=1, child=0, left=0, right=1)
for _ in range(2):
    i = tables.nodes.add_row(flags=1, time=0)
    tables.edges.add_row(parent=1, child=i, left=0, right=1)

tables.sites.add_row(position=0, ancestral_state='a')
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)

tables.sort()
ts = tables.tree_sequence()
ts.allele_frequency_spectrum(sample_sets=[[0]], polarised=True)

The issue here is that all the mutations to the same allele (here b) get counted here; so the algorithm thinks "okay here's an allele that's been inherited by 2 out of these 1 samples", whoops.

The mutation.parent is what's letting us properly count mutations that are nested in others; Even in cases without the bug assert, afs is reporting the wrong thing (heck, probably all the stats are).

nspope commented 2 months ago

Thanks Peter -- so, is the tskit_bug_assert the right exit point here, or should it raise a more specific error? If the former I'll go ahead and close this.

petrelharp commented 2 months ago

Don't close it, I think? It would be nice to give a more informative error. The problem is definitely not with allele_frequency_spectrum, since having correct mutation parents is something we assume in a bunch of places. The right thing to do would be to check mutation parents when loading the tree sequence - and the way to do this is to re-compute them and see if we get the same answer. IIRC we discussed this, but decided it was kinda costly, so didn't want to do it. A more informative error here wouldn't fix the problem because other operations would also turn up wrong info. So, maybe we just need to check this on load, maybe with a flag to disable the check? Either that or decide not to fix it?

petrelharp commented 2 months ago

For context, I think we decided previously that "gee there aren't very many softwares producing tree sequences; they'll just all need to learn to use compute_mutation_parents". Which is I guess what's happening.

jeromekelleher commented 2 months ago

There's a general issue here that we've been too lax with checking mutation parents. We'll probably just have to figure out how to detect whether mutation parents are set properly at load time, and start raising errors if not. Otherwise these types of error will start creeping in more and more.

petrelharp commented 2 months ago

We'll probably just have to figure out how to detect whether mutation parents are set properly at load time

I'm pretty sure the only reasonable way to do that is to just run compute_mutation_parents again: to check mutation parents are right, we've got to build and traverse the trees; and that's what compute_mutation_parents does. I don't see any simpler way to check for correctness, other than e.g. "if there's only one mutation per site then it's fine".

jeromekelleher commented 2 months ago

Your probably right. We could check for cheaper things like if we have two mutations on a branch, the parent must be set, but that's a fairly limited check.

So, I guess the proposal would be to run compute_mutation_parents if we see more than one mutation at any site, and check?