Add pedigree data to individual table.

jeromekelleher commented 4 years ago

Working on the (not ready for use) pedigree simulation in msprime, I realised that adding pedigree data into the tskit data model would be pretty straightforward and may be of more general use as we get into more detailed datasets. Sometimes we do know who the mother and father of individuals are, and we should be able to record this in the data model. I think the minimum we need to do this is to introduce a new ragged parents column, which contains the ID(s) of the individual's parents. Maybe we should also add a sex column, but that's not so clear as it might be as well encoding that in the flags column. Unless we actually need it, perhaps we're better off leaving that can of worms for now.

I don't think there's any backwards compatibility issues with adding this new column - for old tree sequences, we just regard the parents list as empty for all individuals.

Should this be data or metadata? I think it's reasonable to assume that we'll have methods in tskit that combine (or at least check the consistency of) information from the pedigree and trees, so it seems like it should be something that we have within tskit's data model.

Beyond enabling msprime's pedigree simulation, I can see this being useful for other types of simulation where perhaps we want to record the individual relationships as well as the genetic.

Pinging @petrelharp @bhaller @molpopgen @DomNelson @ivan-krukov for thoughts/opinions.

gregorgorjanc commented 4 years ago

Individual relationships (pedigree) are one of the basic concepts in genetics so I definitely vote for this!

hyanwong commented 4 years ago

Seems like a nice idea, but might require a good deal of thought. In particular, how does this work in non-diploid species? Or weirdo ones like ferns or haplodiploids.

hyanwong commented 4 years ago

I agree about parent being data not metadata. I also agree that the sex of the individual is a useful thing, and could probably be encoded as a flag. However, for hemaphrodites, this may not be enough to identify which is mother and which is father.

gregorgorjanc commented 4 years ago

Seems like a nice idea, but might require a good deal of thought. In particular, how does this work in non-diploid species? Or weirdo ones like ferns or haplodiploids.

Not sure about ferns. In haplodiploids, males have a haploid genome that is recombined from mum's genomes (a honey bee drone is effectively a flying sperm!). To handle such cases we would potentially need ploidy info down the line (data or metadata?). Not sure we are there yet. Poly-ploids would follow diploids, I think.

hyanwong commented 4 years ago

Seems like a nice idea, but might require a good deal of thought. In particular, how does this work in non-diploid species? Or weirdo ones like ferns or haplodiploids.

Not sure about ferns. In haplodiploids, males have a haploid genome that is recombined from mum's genomes (a honey bee drone is effectively a flying sperm!). To handle such cases we would potentially need ploidy info down the line (data or metadata?). Not sure we are there yet. Poly-ploids would follow diploids, I think.

Ploidies are present because you can simply count the number of nodes associated with an individual (== ploidy). For polyploids I imagine that most of the time there is a single mother and father, and you get 2 essentially identical chromosome from each. I can't think of any cases where you have more than 2 parents combining to make some form of polyploid, although that would, in fact, be allowed in Jerome's scheme, as the parent column can contain any number of parents. I assume normally it would contain just 1 or 2 individuals. I don't know if it is worth having an "unknown" marker (presumably tskit.NULL == -1) so we can distinguish between an individual that only has a single parent (e.g. the individual is a clone, or perhaps a fern gametophyte), and an individual that has 2 parents, of which only one is known (or present in the tree sequence).

The case of no parents is special, I guess, and simply means that none are known (or none present in the tree sequence). It is presumably impossible for an individual to actually have no parents.

gregorgorjanc commented 4 years ago

I agree about parent being data not metadata. I also agree that the sex of the individual is a useful thing, and could probably be encoded as a flag. However, for hemaphrodites, this may not be enough to identify which is mother and which is father.

This is why both pedigree and genetic info comes handy. Note also that there are different sex mechanisms out there, but storing gender can be detached from this for quite some time.

jeromekelleher commented 4 years ago

Arbitrary ploidies are fine: the proposed column parents is ragged (just like like location: https://tskit.readthedocs.io/en/latest/data-model.html#individual-table).

So, you can have any number of parents on each individual and mix numbers of parents arbitrarily within the table.

petrelharp commented 4 years ago

It's a nice idea! And, it'd make the significant pain I went through to write a method that figured that out from the nodes in pyslim obsolete. If it would really help along the pedigree sims (which it seems like it would?) then I say go for it.

jeromekelleher commented 4 years ago

Would it be useful/used within SLiM do you think @petrelharp? One can imagine keeping track of all of this during a forward simulation, but then we'd have to update simplify to weed out the unused individuals (I think?). We probably wouldn't ever want to keep the full pedigree of a forward sim, I'd imagine.

molpopgen commented 4 years ago

This is stuff I already track as part of individual metadata. It'd be easy to put it into extra columns, too.

petrelharp commented 4 years ago

We probably wouldn't ever want to keep the full pedigree of a forward sim, I'd imagine.

Oh, I do this all the time. Although it's not a common use case.

Would it be useful/used within SLiM do you think @petrelharp?

Yes - there's a "record pedigrees" option in SLiM already. Edit: this is information that SLiM, optionally, already keeps track of, and so if people are wanting to keep track of it, presumably they are also interested to write it out. However, it only makes sense when multiple, subsequent generations are remembered in the tree sequence, as you say.

petrelharp commented 4 years ago

Statistical methods that make use of trio data would like this information somehow.

jeromekelleher commented 4 years ago

OK, this seems to be getting an enthusiastic response so I think it's a good thing to line up for the next set of releases. Thanks for the input!

gregorgorjanc commented 4 years ago

One can imagine keeping track of all of this during a forward simulation, but then we'd have to update simplify to weed out the unused individuals (I think?). We probably wouldn't ever want to keep the full pedigree of a forward sim, I'd imagine.

If we would attach some phenotypes to these dangling nodes then they can contribute information by back-propagation to their ancestor(s) that have other non-dangling descendants

bhaller commented 4 years ago

We probably wouldn't ever want to keep the full pedigree of a forward sim, I'd imagine.

Oh, I do this all the time. Although it's not a common use case.

Would it be useful/used within SLiM do you think @petrelharp?

Yes - there's a "record pedigrees" option in SLiM already. Edit: this is information that SLiM, optionally, already keeps track of, and so if people are wanting to keep track of it, presumably they are also interested to write it out. However, it only makes sense when multiple, subsequent generations are remembered in the tree sequence, as you say.

I'm late to this discussion, sorry. :-> Pedigree recording is always on in SLiM now, as of version 3.5 that will soon be released; before that it was turned on with a flag. SLiM just keeps the unique IDs of each individual's parents and grandparents in the Individual object, and makes them accessible to the user in their script. We have examples in the manual of using this to write out a full pedigree of every individual to a file, and then force that same pedigree to be following, with arranged matings, in a subsequent run. Definitely a useful feature, for a minority of users. Makes sense to me to include it in the tree sequence as optional data.

benjeffery commented 4 years ago

If we have concrete plans for methods within tskit that use pedigree info then it should be added into the data model. What would be suitable for a first motivating example that we could add to the API?

gregorgorjanc commented 4 years ago

Seems like a nice idea, but might require a good deal of thought. In particular, how does this work in non-diploid species? Or weirdo ones like ferns or haplodiploids.

Not sure about ferns. In haplodiploids, males have a haploid genome that is recombined from mum's genomes (a honey bee drone is effectively a flying sperm!). To handle such cases we would potentially need ploidy info down the line (data or metadata?). Not sure we are there yet. Poly-ploids would follow diploids, I think.

Ploidies are present because you can simply count the number of nodes associated with an individual (== ploidy). For polyploids I imagine that most of the time there is a single mother and father, and you get 2 essentially identical chromosome from each. I can't think of any cases where you have more than 2 parents combining to make some form of polyploid, although that would, in fact, be allowed in Jerome's scheme, as the parent column can contain any number of parents. I assume normally it would contain just 1 or 2 individuals. I don't know if it is worth having an "unknown" marker (presumably tskit.NULL == -1) so we can distinguish between an individual that only has a single parent (e.g. the individual is a clone, or perhaps a fern gametophyte), and an individual that has 2 parents, of which only one is known (or present in the tree sequence).

For polyploids I imagine that most of the time there is a single mother and father, and you get 2 essentially identical chromosome from each.

Yes, they have two parents, but progeny get recombined, not identical, parent chromosomes.

jeromekelleher commented 4 years ago

If we have concrete plans for methods within tskit that use pedigree info then it should be added into the data model. What would be suitable for a first motivating example that we could add to the API?

Here's something that would be useful: given a pedigree described by the individual table, give me the "top and bottom" nodes (i.e., the samples and the founders). This should be enough to get us started on the pedigree APIs. (We'd probably want a "pedigree_t" class that was derived from a tree sequence, I guess.)

ivan-krukov commented 4 years ago

Trying to add an implementation of this. Going with ragged arrays for now, not relying on ploidy being specified.

Missing parents: I think those should be represented as TSK_NULL. Using empty parent arrays is possibly space-saving here, but will probably be confusing later (if we want to model clones?).
Should the parent entries be represented as tsk_id_t internally? I don't think there are other examples that have ragged arrays refer to internal IDs.

API:

Founder individuals - individuals without parents
Sample (proband) individuals - all individuals without children. This will require searching through the parents arrays for something like set(individuals) - set(parents).
Validation. Since we are dealing with a directed network, this opens a whole can of worms.
- An individual can not have the same parent twice
- Cannot have self as parent
- Cannot have loops (!)

This last point is potentially time-consuming. I think that moving that to an .validate() method would be warranted here.

jeromekelleher commented 4 years ago

Sounds great, thanks for picking this up @ivan-krukov!

Missing parents: I think those should be represented as TSK_NULL.

Agreed. An empty parent array means zero parents, and an array [-1, -1] means two missing parents. (-1 == TSK_NULL)

Should the parent entries be represented as tsk_id_t internally?

Yes.

For the API, I think just getting the founders would be a great place to start. You're right that validation should have its own function. We can think about what that might look like once we have the basic infrastructure of the actual parents column in there. I think we should probably get the simple stuff done first, and then register some issues to follow up with as it becomes clearer what we want the tskit pedigree API to look like.

gregorgorjanc commented 4 years ago

An individual can not have the same parent twice

Possible in some species (plants, some animals = anything that can self), but a sensible default in most cases. Best to allow for special cases with an argument.

gregorgorjanc commented 4 years ago

Cannot have loops (!)

I think some topological sort implementations can detect loops

petrelharp commented 4 years ago

An individual can not have the same parent twice Possible in some species (plants, some animals = anything that can self), but a sensible default in most cases. Best to allow for special cases with an argument.

A large proportion of plants can self: I would argue against calling plants a "special case", and don't think we should have this as part of a standard requirement or validation check.

Cannot have loops

This is taken care of for nodes anyhow by the requirement that a parent node must have greater time than the child. But, I guess I'm not sure what the scope is here: are you working on something that works with only individuals? Like, pedigrees of individuals with no associated nodes? I guess that's what we'd want to be able to simulate within a pedigree.... but to do that, you'd need times associated with the individuals, to be able to put times on the nodes? One option is to say that individuals get their time from their nodes, and for methods that need individual times but don't already have nodes associated with individuals, the times are passed in as an extra argument. I think I like this option best... with the caveat that we really should have an easy way to get the time attribute of an individual.

hyanwong commented 4 years ago

Did we not have a related discussion where people didn't want an individual to have a time, but wanted to allow the 2 nodes associated with an individual to potentially have different times? Or at least, not to enforce this check? I can't find the thread now though.

gregorgorjanc commented 4 years ago

Did we not have a related discussion where people didn't want an individual to have a time, but wanted to allow the 2 nodes associated with an individual to potentially have different times? Or at least, not to enforce this check? I can't find the thread now though.

Aren't the nodes of an individual constituting an individual hence all should have the same time? Or should there be room for different time between gametes and a zygote?

petrelharp commented 4 years ago

Did we not have a related discussion where people didn't want an individual to have a time, but wanted to allow the 2 nodes associated with an individual to potentially have different times? Or at least, not to enforce this check?

oh! yes. yes we did. the reason being that maybe the gametes would have associated with them the time they were produced, i think, and so are not ncessarily the same.

jeromekelleher commented 4 years ago

I'm imagining having a pedigree fully specified via the node and individual tables, where an individual's time and population are specified through its node(s). Yes, there is some awkwardness and redundancy there, I agree, but it'll work. We will probably enforce equal-times-of-nodes-for-pedigree-individual at some layer, but how we do that is TBD. We'll add some functions to make it easy to get an individual's time/population too.

Possibly what we'll end up doing is having a tsk_pedigree_t class which derives its state from a tsk_table_collection_t. This class will then know about the pedigree individuals and will have all the methods for checking for loops, finding the founders, etc etc.

ivan-krukov commented 4 years ago

Update: I've implemented the first step of adding the extra column on this branch. This is only adding the parent column, the time is, as discussed, attached to the nodes.

Now come the tests. This is still a way from being merge-able.

jeromekelleher commented 4 years ago

Excellent! It'd be good to open PR marked as "draft" if you don't mind @ivan-krukov - just makes a bit easier to see how things look. We won't review until either you ping us explicitly or if you mark it as "ready for review".

tskit-dev / tskit

Add pedigree data to individual table. #852