Some questions about missing data

petrelharp commented 4 years ago

Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:

It is surprising that simplify() can make a formerly non-missing genotype become missing.
An isolated node is considered to have a missing genotype even if it has a mutation above it, which is also surprising.

The first thing is defensible, because we treat the state of every root as the "ancestral state" even though there's got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to have some way to mark an isolated node as "not actually missing".

The second thing does not feel right.

Here's a demonstration of (2), btw:

tables = tskit.TableCollection(sequence_length=1.0)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=1, flags=0)
tables.edges.add_row(parent=3, child=1, left=0, right=1)
tables.edges.add_row(parent=3, child=2, left=0, right=1)
tables.sites.add_row(position=0.5, ancestral_state='x')
tables.mutations.add_row(site=0, node=0, derived_state='y')
ts = tables.tree_sequence()
ts.genotype_matrix()
# array([[-1,  0,  0]], dtype=int8)

I don't remember discussion of this point when working out missing-ness, but maybe I'm forgetting it?

jeromekelleher commented 4 years ago

@hyanwong has brought up the (2) before as a weirdness. It does seem strange all right - IIRC I think the main reason for it is that's how genotypes are generated. Definitely open to suggestions on how we can make it better.

For (1), weird things happen with simplify, but if someone wants to think through the details I'm definitely open to ideas. We haven't fully documented missing data yet, so we are free to make some changes I think.

petrelharp commented 4 years ago

Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".

jeromekelleher commented 4 years ago

Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".

I don't know, tbh. Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?

jeromekelleher commented 4 years ago

I've created a "missing data" project to keep track of stuff.

petrelharp commented 4 years ago

Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?

Not at all, although if we think we might change it, we should put a note in the documentation.

hyanwong commented 4 years ago

The genotypes thing may be being reworked by @mufernando in https://github.com/tskit-dev/tskit/issues/678 so implementing Peter's suggestion could be made part of that?

petrelharp commented 4 years ago

I realized that things would be a lot less surprising if we set the default of impute_missing_data to True (and, in retrospect it would have been nice to call that option isolated_samples_are_missing or something). But, it's probably too late to make that switch?

jeromekelleher commented 4 years ago

I think it's too late for 0.3 anyway @petrelharp - we can pick this up afterwards. We do need to sort out missing data soon, but lets try to ship this big update first.

bhaller commented 4 years ago

Putting it off until after 0.3 seems OK as long as there will not then be a problem of "Well, we can't handle this the way we would like to, because we don't want to break backward compatibility with 0.3". Are we locking ourselves in to what we may conclude later is a bad policy?

jeromekelleher commented 4 years ago

The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case. There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.

bhaller commented 4 years ago

The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case.

Hmm, is that entirely true? I see that @petrelharp has had to make some code changes in SLiM to avoid new problems with "missing data" that were not an issue previously, so it seems like something has changed with 0.3.

There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.

OK.

petrelharp commented 4 years ago

Maybe I hadn't updated the tskit code since before 0.2?

petrelharp commented 4 years ago

@bhaller has some important context about how this looks to SLiM in a different thread: https://github.com/MesserLab/SLiM/pull/101#issuecomment-670562632

hyanwong commented 4 years ago

Here's another (probably silly) suggestion, which may or may not help. It's always bugged me that a site has an "ancestral state", as it's only ancestral relative to the current samples. It becomes impossible to extend the simulation backwards to a time when the ancestral state was something different. It also means that a union of two tree sequences with different ancestral states isn't easy.

Now that we can have multiple mutations at a site, I think it is more logical to imagine the ancestral state as a mutation above the (current) root. In this case, we could default to the ancestral state being missing (tskit.NULL), with a mutation above the root switching that state to whatever we think the ancestral state for the current samples is.

If all sites have this property, then any node that is isolated in a tree would automatically take the ancestral state and therefore be marked as missing, with no further action needed on our part. Nodes that had a non-missing state but whose ancestry was unclear would then have to be marked with a mutation above them.

hyanwong commented 4 years ago

As mentioned in an aside at https://github.com/MesserLab/SLiM/pull/101#issuecomment-670570008, I think that internal non-sample nodes in a tree sequence, whether produced forwards or backwards in time, can be logically thought of as having "missing information" for some of the genome. In other words, if an internal node appears in some trees but not others, it is indicative of the fact that, in the parts of the genome where it does not appear, we no longer have enough information to reconstruct the ancestral haplotype.

The idea of marking some regions of a sample as missing by pruning away the edges that attach them to the rest of the tree in that location is a logical extension of this way of looking at it.

petrelharp commented 4 years ago

Hi, folks - I want to make a plug for changing the name of impute_missing_data before the release - maybe it doesn't matter, since we'll have to keep the old name around anyhow? - but anyhow, here goes:

What "impute missing data" is doing is not really imputing at all, I'd argue. "Impute", for genomes, usually means "fancy guessing based on other nearby genotypes". What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as "imputing", but it's still kinda surprising as at first guess people would assume that tskit would 'impute' using the trees somehow.

And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set impute_missing_data=True, when really what we mean is isolated_samples_are_missing=False. So, we can certainly communicate that this is what everyone needs to do, but I anticipate a fair amount of confusion.

So: my proposal is to rename impute_missing_data=True to isolated_samples_are_missing=False or perhaps even just missing_data=False.

bhaller commented 4 years ago

So: my proposal is to rename impute_missing_data=True to isolated_samples_are_missing=False or perhaps even just missing_data=False.

I completely agree. Passing TSK_IMPUTE_MISSING_DATA feels almost like I'm asking tskit to do the opposite of what I want, in SLiM; I want isolated samples never to be considered "missing", and yet I'm asking tskit to "impute missing data". Very misleading/confusing.

hyanwong commented 4 years ago

Or more specifically, missing_is_ancestral ?

jeromekelleher commented 4 years ago

OK, looks like we need to push the release back then. I've tagged this with 0.3.0. I think we need to have a discussion about this as a group, so let's try and make a time for next week. I'll start a thread on the slack.

jeromekelleher commented 4 years ago

@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here's our conclusions:

The current names for controlling the behaviour are confusing (TSK_IMPUTE_MISSING_DATA), so we need an alternative. I like @hyanwong's idea of TSK_MISSING_AS_ANCESTRAL, but open to ideas here.
An isolated node that has a mutation above it shouldn't be considered missing. We should try to fix this if at all possible.
Most of the reason that missing data is problematic currently in SLiM is because of the current behaviour of simplify, where we remove unary branches from the MRCA of samples back to more ancient nodes. We should change the default behaviour of simplify to keep these edges, because there's important information in knowing how long back your simulation actually started.

I'll take a look at the feasibility of 1 and 2 here and report back.

Any thoughts/objections?

bhaller commented 4 years ago

(Moved my comment to be below yours, @jeromekelleher , and added a bit.) So, we didn't actually choose a name for the flag in our chat today. On the C side, I'd propose TSK_DISABLE_MISSINGNESS. (A bit unfortunate that it's a negative, but since the default is for missingness to be on, I guess that's how it goes.) I'm not sure I'm a fan of TSK_MISSING_AS_ANCESTRAL. To me that doesn't convey what I think of as the essential function of this flag: turning off tskit's heuristics about what is "missing" and what is "not missing" entirely, and marking the tree sequence as not involving "missingness" at all.

hyanwong commented 4 years ago

In TSK_MISSING_AS_ANCESTRAL, you could think of the "MISSING" as meaning that the ancestry is missing from that sample, not (just) that this is "missing data".

benjeffery commented 4 years ago

Maybe something more explicit like TSK_ISOLATED_AS_ANCESTRAL?

petrelharp commented 4 years ago

Thinking this through... for the name of the flag/option, the two options are essentially either (a) "do we intend isolated samples to be missing data" or (b) "let isolated samples take the ancestral state". The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do.

Thinking ahead to statistics: what we'll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with na.rm=TRUE in R). Either of the options above works, but I think option (a) better connotes what we will do if there is missing data? What about isolated_samples_missing, defaulting to True, and a C flag of TSK_ISOLATED_SAMPLES_NOT_MISSING?

jeromekelleher commented 4 years ago

This sounds very sensible @petrelharp - my only reservations is that isolated_samples_missing is quite a long name. I'm tempted to have something like model_missing=True, or missingness=True, or allow_missing, or enable_missing or even missing_data (that's too terse an uninterpretable, though). Really I think we want a flag (like @bhaller says) which just says allow for missingness in the data (modelled by isolated samples) or just assume that everyone gets the ancestral state.

petrelharp commented 4 years ago

Gee, missing_data doesn't sound too cryptic to me?

For the stats, remove_missing would be more descriptive, but that doesn't work for e.g. .variants(). So, I'd vote for missing_data ... and maybe TSK_NO_MISSING_DATA? What do you all think?

hyanwong commented 4 years ago

isolated_as_missing is a bit shorter than isolated_samples_missing, and we don't need the work "samples", since the word "isolated" is only ever applied to samples, ISTR.

hyanwong commented 4 years ago

By the way, my (slight) reservation about using the term "missing" in the description is that implies that this flag is slightly exceptional. But I suspect that as tree sequences start to be used for real life, rather than simple simulated datasets, missing data in samples will become the norm, rather than the exception. Or to put it another way, @petrelharp said "option (a) better connotes what we will do if there is missing data", but I think that the default mindset of most users will be to expect tree sequences to deal naturally with missing data. Therefore it might be better to describe the exceptional case: what we do if instead of allowing missing data, we substitute something else in there.

But this probably reflects my biases coming from the world of tsinfer & real sequence data, rather than SLiM + simulation.

bhaller commented 4 years ago

isolated_as_missing seems pretty good to me; maybe isolated_are_missing would be even better? And on the C side, TSK_ISOLATED_ARE_NOT_MISSING or (less verbosely, and it still seems clear) TSK_ISOLATED_NOT_MISSING? missing_data is cryptic to me, I agree, because it sounds like you could be saying, by passing false, "I guarantee that missing data is not present" rather than "Do not consider any data to be missing". And I like "isolated" being in there because it really makes clear what's being talked about in practice – what the flag will actually govern.

benjeffery commented 4 years ago

Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.

isolated_are_missing seems to have most support if I've read the thread correctly. On the C side there is less consensus, but TSK_ISOLATED_NOT_MISSING seems to mirror the python name.

Unless there is any issue shall we go with these?

hyanwong commented 4 years ago

I agree with @bhaller that the word "isolated" in there is very useful. I marginally prefer isolated_as_ancestral default False (or s/as/are/, whichever sounds nicer), but that's because I think of this as the unusual case, which needs a description to explain it, whereas from my POV it's clear that an isolated node should be thought of as unknown (=missing) and doesn't require a description. But I'm probably unusual in my thinking here, so I'd be happy to be outvoted.

[edit - I guess that would mean we could have an exact equivalent C name, TSK_ISOLATED_ARE_ANCESTRAL, which could be an advantage?]

jeromekelleher commented 4 years ago

I'm coming around to isolated_as_missing=True - this seems to convey the most information, and since people will need to use the option very rarely (missing data will be handled correctly, by default), it's OK that it's a little on the long side.

petrelharp commented 4 years ago

I'm with @jeromekelleher, and ISOLATED_NOT_MISSING on the C side.

bhaller commented 4 years ago

I've probably already made my position clear, but I'll say it again for the record. :-> I vote for isolated_as_missing or isolated_are_missing, and with TSK_ISOLATED_NOT_MISSING. Not a fan of the ancestral variants.

benjeffery commented 4 years ago

A draft PR for these changes is at https://github.com/tskit-dev/tskit/pull/794

tskit-dev / tskit

Some questions about missing data #716