Closed petrelharp closed 4 years ago
@hyanwong has brought up the (2) before as a weirdness. It does seem strange all right - IIRC I think the main reason for it is that's how genotypes are generated. Definitely open to suggestions on how we can make it better.
For (1), weird things happen with simplify, but if someone wants to think through the details I'm definitely open to ideas. We haven't fully documented missing data yet, so we are free to make some changes I think.
Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".
Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".
I don't know, tbh. Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?
I've created a "missing data" project to keep track of stuff.
Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?
Not at all, although if we think we might change it, we should put a note in the documentation.
The genotypes thing may be being reworked by @mufernando in https://github.com/tskit-dev/tskit/issues/678 so implementing Peter's suggestion could be made part of that?
I realized that things would be a lot less surprising if we set the default of impute_missing_data
to True (and, in retrospect it would have been nice to call that option isolated_samples_are_missing
or something). But, it's probably too late to make that switch?
I think it's too late for 0.3 anyway @petrelharp - we can pick this up afterwards. We do need to sort out missing data soon, but lets try to ship this big update first.
Putting it off until after 0.3 seems OK as long as there will not then be a problem of "Well, we can't handle this the way we would like to, because we don't want to break backward compatibility with 0.3". Are we locking ourselves in to what we may conclude later is a bad policy?
The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case. There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.
The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case.
Hmm, is that entirely true? I see that @petrelharp has had to make some code changes in SLiM to avoid new problems with "missing data" that were not an issue previously, so it seems like something has changed with 0.3.
There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.
OK.
Maybe I hadn't updated the tskit code since before 0.2?
@bhaller has some important context about how this looks to SLiM in a different thread: https://github.com/MesserLab/SLiM/pull/101#issuecomment-670562632
Here's another (probably silly) suggestion, which may or may not help. It's always bugged me that a site has an "ancestral state", as it's only ancestral relative to the current samples. It becomes impossible to extend the simulation backwards to a time when the ancestral state was something different. It also means that a union
of two tree sequences with different ancestral states isn't easy.
Now that we can have multiple mutations at a site, I think it is more logical to imagine the ancestral state as a mutation above the (current) root. In this case, we could default to the ancestral state being missing (tskit.NULL
), with a mutation above the root switching that state to whatever we think the ancestral state for the current samples is.
If all sites have this property, then any node that is isolated in a tree would automatically take the ancestral state and therefore be marked as missing, with no further action needed on our part. Nodes that had a non-missing state but whose ancestry was unclear would then have to be marked with a mutation above them.
As mentioned in an aside at https://github.com/MesserLab/SLiM/pull/101#issuecomment-670570008, I think that internal non-sample nodes in a tree sequence, whether produced forwards or backwards in time, can be logically thought of as having "missing information" for some of the genome. In other words, if an internal node appears in some trees but not others, it is indicative of the fact that, in the parts of the genome where it does not appear, we no longer have enough information to reconstruct the ancestral haplotype.
The idea of marking some regions of a sample as missing by pruning away the edges that attach them to the rest of the tree in that location is a logical extension of this way of looking at it.
Hi, folks - I want to make a plug for changing the name of impute_missing_data
before the release - maybe it doesn't matter, since we'll have to keep the old name around anyhow? - but anyhow, here goes:
What "impute missing data" is doing is not really imputing at all, I'd argue. "Impute", for genomes, usually means "fancy guessing based on other nearby genotypes". What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as "imputing", but it's still kinda surprising as at first guess people would assume that tskit would 'impute' using the trees somehow.
And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set impute_missing_data=True
, when really what we mean is isolated_samples_are_missing=False
. So, we can certainly communicate that this is what everyone needs to do, but I anticipate a fair amount of confusion.
So: my proposal is to rename impute_missing_data=True
to isolated_samples_are_missing=False
or perhaps even just missing_data=False
.
So: my proposal is to rename
impute_missing_data=True
toisolated_samples_are_missing=False
or perhaps even justmissing_data=False
.
I completely agree. Passing TSK_IMPUTE_MISSING_DATA
feels almost like I'm asking tskit to do the opposite of what I want, in SLiM; I want isolated samples never to be considered "missing", and yet I'm asking tskit to "impute missing data". Very misleading/confusing.
Or more specifically, missing_is_ancestral
?
OK, looks like we need to push the release back then. I've tagged this with 0.3.0. I think we need to have a discussion about this as a group, so let's try and make a time for next week. I'll start a thread on the slack.
@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here's our conclusions:
I'll take a look at the feasibility of 1 and 2 here and report back.
Any thoughts/objections?
(Moved my comment to be below yours, @jeromekelleher , and added a bit.) So, we didn't actually choose a name for the flag in our chat today. On the C side, I'd propose TSK_DISABLE_MISSINGNESS. (A bit unfortunate that it's a negative, but since the default is for missingness to be on, I guess that's how it goes.) I'm not sure I'm a fan of TSK_MISSING_AS_ANCESTRAL. To me that doesn't convey what I think of as the essential function of this flag: turning off tskit's heuristics about what is "missing" and what is "not missing" entirely, and marking the tree sequence as not involving "missingness" at all.
In TSK_MISSING_AS_ANCESTRAL
, you could think of the "MISSING" as meaning that the ancestry is missing from that sample, not (just) that this is "missing data".
Maybe something more explicit like TSK_ISOLATED_AS_ANCESTRAL
?
Thinking this through... for the name of the flag/option, the two options are essentially either (a) "do we intend isolated samples to be missing data" or (b) "let isolated samples take the ancestral state". The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do.
Thinking ahead to statistics: what we'll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with na.rm=TRUE
in R). Either of the options above works, but I think option (a) better connotes what we will do if there is missing data? What about isolated_samples_missing
, defaulting to True
, and a C flag of TSK_ISOLATED_SAMPLES_NOT_MISSING?
This sounds very sensible @petrelharp - my only reservations is that isolated_samples_missing
is quite a long name. I'm tempted to have something like model_missing=True
, or missingness=True
, or allow_missing
, or enable_missing
or even missing_data
(that's too terse an uninterpretable, though). Really I think we want a flag (like @bhaller says) which just says allow for missingness in the data (modelled by isolated samples) or just assume that everyone gets the ancestral state.
Gee, missing_data
doesn't sound too cryptic to me?
For the stats, remove_missing
would be more descriptive, but that doesn't work for e.g. .variants()
. So, I'd vote for missing_data
... and maybe TSK_NO_MISSING_DATA
? What do you all think?
isolated_as_missing
is a bit shorter than isolated_samples_missing
, and we don't need the work "samples", since the word "isolated" is only ever applied to samples, ISTR.
By the way, my (slight) reservation about using the term "missing" in the description is that implies that this flag is slightly exceptional. But I suspect that as tree sequences start to be used for real life, rather than simple simulated datasets, missing data in samples will become the norm, rather than the exception. Or to put it another way, @petrelharp said "option (a) better connotes what we will do if there is missing data", but I think that the default mindset of most users will be to expect tree sequences to deal naturally with missing data. Therefore it might be better to describe the exceptional case: what we do if instead of allowing missing data, we substitute something else in there.
But this probably reflects my biases coming from the world of tsinfer & real sequence data, rather than SLiM + simulation.
isolated_as_missing
seems pretty good to me; maybe isolated_are_missing
would be even better? And on the C side, TSK_ISOLATED_ARE_NOT_MISSING
or (less verbosely, and it still seems clear) TSK_ISOLATED_NOT_MISSING
? missing_data
is cryptic to me, I agree, because it sounds like you could be saying, by passing false, "I guarantee that missing data is not present" rather than "Do not consider any data to be missing". And I like "isolated" being in there because it really makes clear what's being talked about in practice – what the flag will actually govern.
Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.
isolated_are_missing
seems to have most support if I've read the thread correctly. On the C side there is less consensus, but TSK_ISOLATED_NOT_MISSING
seems to mirror the python name.
Unless there is any issue shall we go with these?
I agree with @bhaller that the word "isolated" in there is very useful. I marginally prefer isolated_as_ancestral
default False (or s/as/are/
, whichever sounds nicer), but that's because I think of this as the unusual case, which needs a description to explain it, whereas from my POV it's clear that an isolated node should be thought of as unknown (=missing) and doesn't require a description. But I'm probably unusual in my thinking here, so I'd be happy to be outvoted.
[edit - I guess that would mean we could have an exact equivalent C name, TSK_ISOLATED_ARE_ANCESTRAL
, which could be an advantage?]
I'm coming around to isolated_as_missing=True
- this seems to convey the most information, and since people will need to use the option very rarely (missing data will be handled correctly, by default), it's OK that it's a little on the long side.
I'm with @jeromekelleher, and ISOLATED_NOT_MISSING
on the C side.
I've probably already made my position clear, but I'll say it again for the record. :-> I vote for isolated_as_missing
or isolated_are_missing
, and with TSK_ISOLATED_NOT_MISSING
. Not a fan of the ancestral
variants.
A draft PR for these changes is at https://github.com/tskit-dev/tskit/pull/794
Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:
simplify()
can make a formerly non-missing genotype become missing.The first thing is defensible, because we treat the state of every root as the "ancestral state" even though there's got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to have some way to mark an isolated node as "not actually missing".
The second thing does not feel right.
Here's a demonstration of (2), btw:
I don't remember discussion of this point when working out missing-ness, but maybe I'm forgetting it?