tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

define individual flags #482

Closed petrelharp closed 6 years ago

petrelharp commented 6 years ago

There are a few flags for the individual tables in #472 that should be standard, at least (I propose):

MSP_INDIVIDUAL_SEX_FEMALE 1 MSP_INDIVIDUAL_SEX_MALE 2

If these are each flags then absence of both flags, or presence of both, could mean appropriate things in certain systems.

An alternative to this scheme would be to have _XX, _XY, _WW, _WZ, _U, _V, _UV, and something for haplodiploids. This doesn't seem right, though, as we won't always even have the sex chromosomes. And what about platypuses?

We could also have flags for ploidy... but how many possibilities? Perhaps that's better left to metadata.

Any other standard flags in SLiM, @bhaller?

jeromekelleher commented 6 years ago
MSP_INDIVIDUAL_SEX_FEMALE 1
MSP_INDIVIDUAL_SEX_MALE 2

sounds good to me, and should cover a large number of biologically relevant situations.

I don't think we need to worry about ploidy at the moment, as we can determine it by looking at the number of nodes associated with each individual. If this becomes awkward for whatever reason, we can always add another column to the individual table.

petrelharp commented 6 years ago

I've put this in to #472 (but not devised any tests... I suppose we could implement a bisexual Wright-Fisher model to test it with, somehow...).

bhaller commented 6 years ago

OK. SLiM uses a separate value for hermaphrodites; looks like you've decided to use (MSP_INDIVIDUAL_SEX_FEMALE | MSP_INDIVIDUAL_SEX_MALE) to represent hermaphrodites instead, which is also reasonable, but will require translation back and forth. I see your commit is doing that, so that's fine. But do we really not want a separate MSP_INDIVIDUAL_HERMAPHRODITE value instead?

A question: will a regular msprime coalescent run write out an individuals table? If so, what will that table say? Will it describe the individuals as being haploid, with one node per individual? If so, should there be a value for that, like MSP_INDIVIDUAL_HAPLOID or something? Or will it group pairs of nodes into diploid hermaphrodite individuals? Or will there be no individuals table, in that case? Or what?

X, Y, W, Z, etc., seem like they would be attributes on nodes, not on individuals; a given chromosome is an X or a Y etc. And in any case ploidy, sex chromosomes, etc. gets quite complex, since we can imagine a .trees file containing information about multiple chromosomes, etc.; you might want to model three autosomes plus an X and a Y, and throw the results all into one .trees file, no? We could devise a complicated scheme that would allow us to represent that sort of thing, but that feels like overkill right now. Right now, though, SLiM can already model separate sexes or hermaphrodites (we have that covered with the proposed flags), and if separate sexes are modeled then it can model either an autosome or an X/Y system (we do not have that covered yet). So the question is, do we want a standard way for SLiM to communicate what it has done, or do we just want a way for SLiM to "talk to itself" in private metadata so that it can load in files that it has itself saved?

bhaller commented 6 years ago

(To be clear, when I say above "MSP_INDIVIDUAL_HAPLOID", my question is really: what value will msprime write out for the sex field in the individuals table, and should be a defined constant that describes the meaning of the value that msprime writes out (which will presumably not be MSP_INDIVIDUAL_SEX_FEMALE or MSP_INDIVIDUAL_SEX_MALE). If it writes out zero, then it seems like there should be a defined constant that says what zero means.)

petrelharp commented 6 years ago

do we really not want a separate MSP_INDIVIDUAL_HERMAPHRODITE value instead

I don't think so. Because: then what would MALE | FEMALE mean? Best to keep things unambiguous. Ah, but your point is, how's it different from 0? I'd say that 0 means "we haven't specified", like SLiM's Unspecified; which I think is conceptually different to hermaphrodite. (at least in a coalescent simulation, when it could make sense to say "our model says this individual has a sex but we don't know what it is")

X, Y, W, Z, etc., seem like they would be attributes on nodes, not on individuals; ... an X/Y system (we do not have that covered yet)

Ah, good idea! We could use up 6 flags on XY, ZW and UV sex determination systems; but since no-one is going to use more than one at once, we could just define:

MSP_NODE_HAS_SEX_CHROMOSOME_X 2 MSP_NODE_HAS_SEX_CHROMOSOME_Z 2 MSP_NODE_HAS_SEX_CHROMOSOME_U 2 MSP_NODE_HAS_SEX_CHROMOSOME_Y 4 MSP_NODE_HAS_SEX_CHROMOSOME_W 4 MSP_NODE_HAS_SEX_CHROMOSOME_V 4

I'd say we should use these flags; the question is whether to have them standard in tskit.

petrelharp commented 6 years ago

A question: will a regular msprime coalescent run write out an individuals table?

Oh, right: regular msprime does haploid individuals without defined sex. So, it won't write out an individuals table, although we need to write a function that will make one up. I don't think we need a HAPLOID flag, although someone studying a haplodiploid system could clearly use one.

bhaller commented 6 years ago

I don't think so. Because: then what would MALE | FEMALE mean?

Well, I guess I'm proposing that they would be like an enumeration, not flags that would be OR'd together. You could have hermaphrodite, male, female, and then (potentially) other values for odd situations like haplodiploidy etc. MALE | FEMALE wouldn't mean anything, any more than ORing together any other enumeration values means anything. :->

I'd say we should use these flags; the question is whether to have them standard in tskit.

My opinion is that the more things are make explicit and standard in tskit, the more simple it will be to interchange data between different programs. If each program uses different specialized metadata schemes to represent concepts that are actually universal, it will make interchange much harder. However, things like ploidy and sex chromosomes and such are complicated enough that there is also a danger of over-standardizing, if we fail to anticipate the ways in which people might want to use this stuff. Still, I lean in the direction of explicitly declared interchange standards.

jeromekelleher commented 6 years ago

My opinion is that the more things are make explicit and standard in tskit, the more simple it will be to interchange data between different programs. If each program uses different specialized metadata schemes to represent concepts that are actually universal, it will make interchange much harder. However, things like ploidy and sex chromosomes and such are complicated enough that there is also a danger of over-standardizing, if we fail to anticipate the ways in which people might want to use this stuff. Still, I lean in the direction of explicitly declared interchange standards.

Having spent several years in the GA4GH working on data standards I have a keen appreciation for the dangers of both under- and over-standardising. Biology is complicated, so coming up with a standard, neat way of encoding every possibility is basically impossible I think. My take would be to define some simple flags that capture the common cases that we often deal with well, and try to leave the door open for further extensions later. I think it's inevitable that there will be some overlaps and inconsistencies in the concepts, because basically that's how biology works. We're not just dealing with simulations here; we do hope to be able to represent arbitrary real data in this form some day.

There is one concrete lesson that I learned from GA4GH though: do not standardise something unless you have an immediate use for it. So, I'm happy to add a bunch of flags here, but only if there are of immediate use to one (ideally, more) programs.

@molpopgen, we should have your thoughts on this too. What's your take on defining individual attributes like sex and so on?

molpopgen commented 6 years ago

I think sex flags quickly becomes a can of worms. Defining two as @petrelharp suggested doesn't go far enough--there are systems with many more than two mating types. Same issues for sex chromosomes. This stuff is internal book-keeping details for a simulator, IMO, and I think it falls under the category of meta-data.

Overall, I think I'd prefer to see tskit do as little as possible, meaning that it should handle the tables necessary to represent the genealogies. My opinion is that it is reasonable to expect forward sim authors to book-keep their own stuff specific to how they choose to model the world. Sex is a book-keeping detail as affecting how the edges are generated. Same with sex chromosomes. Thus, tskit can, and perhaps should, be agnostic to all that stuff, but allow for it to be written as metadata.

molpopgen commented 6 years ago

Still, I lean in the direction of explicitly declared interchange standards.

I don't think we need to specify all this stuff to have good interchange standards. If we require a bunch of stuff now, that is simply constraints placed on future authors. The forward simulation authors should be required to document their metadata. If done properly, that data could be read from anywhere. But I'm not sure that's really needed--I don't think we're that interested in a world where the output of forward simulator A can be read by simulator B via a file containing tree sequences.

bhaller commented 6 years ago

I don't think we're that interested in a world where the output of forward simulator A can be read by simulator B via a file containing tree sequences.

No, probably not; but we probably are interested in a world where forward simulators can read stuff generated by msprime, and where msprime can read (and understand!) stuff generated by the forward simulators. Easy interchange with msprime is my primary concern.

To be clear, I am not advocating for trying to define some comprehensive set of values for every possible sexual system. I'm only advocating for (a) having values for male and female, (b) having a value for "hermaphrodite" that is a separate value, rather than (male | female), since I think these are better as an enumeration than as flags that get OR'd together, and (c) having some kind of reserved space in the value range that forward simulators can define their own values within, without fear of collision with later tskit changes.

If msprime will never attach meaning to male/female/hermaphodite, will never want to treat those differently in any way, and will never want to know which is the case for data imported into it, then I'm OK with all of the values, including male/female/hermaphodite, being metadata. But is that really the case?

jeromekelleher commented 6 years ago

To be clear, I am not advocating for trying to define some comprehensive set of values for every possible sexual system. I'm only advocating for (a) having values for male and female, (b) having a value for "hermaphrodite" that is a separate value, rather than (male | female), since I think these are better as an enumeration than as flags that get OR'd together, and (c) having some kind of reserved space in the value range that forward simulators can define their own values within, without fear of collision with later tskit changes.

@bhaller's proposal seems sensible to me. Hermaphrodite has a clear meaning in plant breeding systems, right?

I can see uses for splitting by sex in tskit, since we'll be computing statistics and there are surely statistics in which we need to treat males and females differently. So, my take would be that these three flags are good.

The idea of reserving some of the flag bits for application use is an excellent one, and we should do it I think.

I agree with your perspective @molpopgen, and we should keep the amount of stuff defined by tskit to a minimum. But I think this minimum probably does include some basic handling of sex --- we're talking about real data as well here, not just simulations.

petrelharp commented 6 years ago

OK, let's put in

MSP_INDIVIDUAL_SEX_HERMAPHRODITE 4

Mating types is a whole different thing, and I agree, we can't generalize.

bhaller commented 6 years ago

MSP_INDIVIDUAL_SEX_HERMAPHRODITE 3, not 4, no? As I've advocated above, this should be an enumeration, not OR'd flags; if third party software like SLiM adds private values for other mating types (haplodiploidy etc.), those would be additional enumeration values. I would suggest that tskit should define a starting value for such private use, like:

#define MSP_INDIVIDUAL_SEX_FIRST_PRIVATE 1000 or some such, and then private use would count up as 1000, 1001, 1002...

@petrelharp, if you let me know when these changes have trickled down to SLiM's copy of tskit, I'll be happy to make the corresponding changes in the SLiM code, since I know you're busy lately!

petrelharp commented 6 years ago

this should be an enumeration, not OR'd flags

Huh. But, what if someone doing GWAS wants to add a flag for affected; then how do you flag "affected and male"?

bhaller commented 6 years ago

Huh. But, what if someone doing GWAS wants to add a flag for affected; then how do you flag "affected and male"?

Oh, I think I'm advocating that "sex" be a separate enumeration column, not part of flags any more. Why it ought to be enumeration is, I think, clear if you contemplate further sex types. If you had worker, queen, etc., I don't think it makes any sense for each of those to be represented as a distinct flag bit (and if you tried to do that you would risk running out of bits, if you really wanted to model a diversity of mating systems!), nor would it really be coherent to try to represent them with OR'd flags, such that a queen is (female | non-sterile | diploid | dominant) or some crazy scheme. An enumeration seems much more general and expandable to me. (If you're not convinced yet, contemplate systems with regular males versus sneaker males; systems where individuals change sex over the span of their life; systems where sex is determined by natal environment; systems where there is more than one type of hermaphrodite; and so forth. An enumeration just works way better for this than binary flags.)

If we didn't want to add the column, then we could conceivably carve out a sub-section of the flags column (bits 0 through 7, say) that is reserved for enumerated (i.e. non-flag) sex values; but that seems quite ugly to me. I think a new column is in order. Sorry, I should have been clearer about that before.

petrelharp commented 6 years ago

I'm advocating that "sex" be a separate enumeration column

Oh, I see. I think that if we don't use flags for sex, then why do we even have flags? I am also by this point pretty averse to more adding/removing of columns (it's pretty annoying, and I thought it was settled).

you would risk running out of bits

We have 32 bits, and we're only proposing to use a few of these. It's hard for me to imagine a situation where you'd want anything approaching 20 different flags. (Some systems have this many mating types, but these are encoded in loci, so really this should be genotype-dependent mating...) We are not proposing to write down an enumeration list of all possible assignments all at once; so the number of flags has to be sufficient for any one situation, not all possible situations. That's why I want to stick with just MALE and FEMALE (and HERMAPHRODITE, why not).

An enumeration seems much more general and expandable to me

True, if you're talking about one characteristic; but what we want to use flags for is a single column that we can pack up to thirty-two binary characteristics into all at once, to allow for future flexibility. I'd say that if you want to do some enumeration scheme, stick it in metadata.

bhaller commented 6 years ago

OK, the timing of this is not great, agreed; I wish I had thought to bring this up prior to the big individual table overhaul. I think the right design would entail a separate column for sex / mating type; if you have a concept – sex / mating-type – that isn't well-represented by binary flags, and that has a large number of possible values across biology, then it just seems unwise to put that concept into a binary flags field. But given that we're trying to freeze things, OK. I'll concede.

So I would then propose that if we accept that mating-system / sex information beyond male/female/hermaphrodite will be client-dependent and go into metadata, not into flags, then I would propose using two flag bits for sex in tskit, and defining:

#define MSP_INDIVIDUAL_SEX_UNDEFINED 0
#define MSP_INDIVIDUAL_SEX_FEMALE 1
#define MSP_INDIVIDUAL_SEX_MALE 2
#define MSP_INDIVIDUAL_SEX_HERMAPHRODITE 3

Which is what we have now, plus the hermaphrodite value and an undefined value that would be used in all cases when sex / mating-type is either unknown or defined in metadata.

I would also propose defining a mask, in tskit, that sets out the bits that are reserved by tskit for future use (the rest of the bits then being open to use by clients of tskit); something like:

#define MSP_INDIVIDUAL_RESERVED_BITS 0x0000FFFF Something similar for the flags field in nodes would also be a good idea, I think. Sound good?

petrelharp commented 6 years ago

It appears that our two distinct proposals have converged. =)

I agree, except to remove the UNDEFINED one.

#define MSP_INDIVIDUAL_SEX_FEMALE 1
#define MSP_INDIVIDUAL_SEX_MALE 2
#define MSP_INDIVIDUAL_SEX_HERMAPHRODITE 3
#define MSP_INDIVIDUAL_RESERVED_BITS 0x0000FFFF
jeromekelleher commented 6 years ago

We haven't actually merged the individual table stuff yet, so if an enum really is the right way to model this, then now is the time to change it.

Before we go any further though, let's be clear on why we're standardising stuff in the first place: so that client code can communicate important information to tskit. Anything that doesn't need to be communicated to tskit can be put into metadata. Why would tskit need to know about stuff? So that it can , .e.g, split things into males and females when computing statistics. Unless there are statistics that depend critically on idiosyncratic mating systems and so on, then tskit doesn't need to know about it. And, the more I think about it, the less convinced I am that tskit needs to know about males and females either: stats will be most generally defined using numpy arrays of individual or node IDs, and this provides a completely general way of splitting your samples whatever way you want.

So, my vote would be to forget about defining these flags altogether for now until we have a use for them. We can keep the flags column since it seems like a handy place to store data in the future that tskit might need to know about, but in general client code should be using metadata to store information that users need to know (as opposed to tskit).

petrelharp commented 6 years ago

my vote would be to forget about defining these flags altogether for now until we have a use for them.

This makes sense. The motivation for defining them here was to standardize across applications, because it's predictable that many applications will have a need for recording sex. Nonetheless, I support this proposal.

molpopgen commented 6 years ago

my vote would be to forget about defining these flags altogether for now until we have a use for them.

Agreed. Once you're interacting with the data in msprime, it is all about sets of nodes, and the metadata seems like a good way of letting clients specify anything they want.

jeromekelleher commented 6 years ago

@bhaller, what's your take here? Flags strictly for internal use in tskit OK with you?

bhaller commented 6 years ago

If tskit has no interest in the sex of individuals (which surprises me, but OK), then it makes sense for it to be metadata, yes. And if we want to say that all non-tskit-defined flags should go into metadata as well, that is also OK with me.

jeromekelleher commented 6 years ago

OK, well lets close this for now then. If we do start caring about individual properties within tskit, then we can reopen and take it up again.