tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Multiple chromosomes #176

Open jeromekelleher opened 5 years ago

jeromekelleher commented 5 years ago

@DomNelson and I were just discussing the possibility of adding support for multiple chromosomes to tskit. One possibility which seems like it might be a reasonably smooth path forward is the following:

  1. Add a chromosome table with an ID, name, length and metadata.
  2. Add a chromosome column to the edge table. Within a chromosome, coordinates must be from 0 to the chromosome's length.
  3. Either deprecate the sequence_length property (probably best), or make it equal to the sum of the length of all chromosomes.

For things like trees() we could add an optional chromosome argument. If the tree sequence contains multiple chromosomes which would raise an error if it's not specified. For tree sequences with a single chromosome, things would continue to work as now.

Any thoughts @petrelharp, @hyanwong, @bhaller, @molpopgen?

molpopgen commented 5 years ago

I have implemented discrete chromosomes in fwdpy11. Length refers to the total genome length, and the recombination functions work to ensure correct recording of transmitted edges. In practice, keeping the individual chromosome lengths as a global variable in a python script has been sufficient for any downstream analysis. I'm in favor of not adding more tables, although I do see the utility for considering "real" data.

On Tue, Apr 16, 2019, 1:21 PM Jerome Kelleher notifications@github.com wrote:

@DomNelson https://github.com/DomNelson and I were just discussing the possibility of adding support for multiple chromosomes to tskit. One possibility which seems like it might be a reasonably smooth path forward is the following:

  1. Add a chromosome table with an ID, name, length and metadata.
  2. Add a chromosome column to the edge table. Within a chromosome, coordinates must be from 0 to the chromosome's length.
  3. Either deprecate the sequence_length property (probably best), or make it equal to the sum of the length of all chromosomes.

For things like trees() we could add an optional chromosome argument. If the tree sequence contains multiple chromosomes which would raise an error if it's not specified. For tree sequences with a single chromosome, things would continue to work as now.

Any thoughts @petrelharp https://github.com/petrelharp, @hyanwong https://github.com/hyanwong, @bhaller https://github.com/bhaller, @molpopgen https://github.com/molpopgen?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/176, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnHyBN448cJH8LuU53ehq2rUFjU1-yks5vhjCqgaJpZM4czoSO .

molpopgen commented 5 years ago

One question: if you add chromosome labels, and edges on chromosome i are in [0,i-1), then do you plan to simplify separately by chromosome or change how edges are sorted to include the chromosome label?

I've let edges cross "chromosome" boundaries, which occurs at 1 minus the crossover rates between chromosomes. The nice feature of this approach is that none of the sorting and simplifying code needed touching.

On Tue, Apr 16, 2019, 1:21 PM Jerome Kelleher notifications@github.com wrote:

@DomNelson https://github.com/DomNelson and I were just discussing the possibility of adding support for multiple chromosomes to tskit. One possibility which seems like it might be a reasonably smooth path forward is the following:

  1. Add a chromosome table with an ID, name, length and metadata.
  2. Add a chromosome column to the edge table. Within a chromosome, coordinates must be from 0 to the chromosome's length.
  3. Either deprecate the sequence_length property (probably best), or make it equal to the sum of the length of all chromosomes.

For things like trees() we could add an optional chromosome argument. If the tree sequence contains multiple chromosomes which would raise an error if it's not specified. For tree sequences with a single chromosome, things would continue to work as now.

Any thoughts @petrelharp https://github.com/petrelharp, @hyanwong https://github.com/hyanwong, @bhaller https://github.com/bhaller, @molpopgen https://github.com/molpopgen?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/176, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnHyBN448cJH8LuU53ehq2rUFjU1-yks5vhjCqgaJpZM4czoSO .

bhaller commented 5 years ago

SLiM allows simulation of multiple chromosomes, in practice, by letting the user define a recombination map with a recombination rate of 0.5 between specific pairs of bases that constitute the end of one chromosome and the start of the next. With this method you can simulate as many chromosomes as you want and it works quite nicely, but it's not terribly aesthetically pleasing. We may put a layer on top of this machinery that hides this implementation detail from the user and instead lets them formally define multiple chromosomes; but I imagine it will keep working the same way under the hood, more or less. So the upshot is: like fwdpy, we're already doing this and haven't seen any particular need for a new chromosome table, etc. But if those features were to appear in tskit, we could certainly adapt to using them.

My main question is: would this just be essentially semantic information, or would some real, substantial benefit accrue as a result? If it's just semantics, then so far SLiM users don't seem to care, and I would lean toward "why complicate our lives further?" But if there were a substantial benefit, then that's another matter. Apropos of @molpopgen's question about simplification, for example: if knowing about chromosome structure would allow a 5-chromosome model to simplify much faster, or with a much lower peak memory usage, than a 1-chromosome model of the same total genome length, that would suddenly look very interesting indeed, I think.

molpopgen commented 5 years ago

I'd probably populate the tables when "dumping" to tskit's format. But I'm practice, we keep everything in the fwdpp format as the metadata are much smaller.

I may be thinking sloppily, but I'm writing from a courthouse awaiting a jury call, but simplifying separately may "get weird", unless you track separate node tables per chromosome? If you fully simplify chromosome zero, and then chromosome 1 has a node specific to it and more recent than most ancient node on chr0, you have a book keeping violation. If you build the "global" node table to avoid this, then you've basically run the entire algorithm, right?

On Tue, Apr 16, 2019, 1:56 PM Ben Haller notifications@github.com wrote:

SLiM allows simulation of multiple chromosomes, in practice, by letting the user define a recombination map with a recombination rate of 0.5 between specific pairs of bases that constitute the end of one chromosome and the start of the next. With this method you can simulate as many chromosomes as you want and it works quite nicely, but it's not terribly aesthetically pleasing. We may put a layer on top of this machinery that hides this implementation detail from the user and instead lets them formally define multiple chromosomes; but I imagine it will keep working the same way under the hood, more or less. So the upshot is: like fwdpy, we're already doing this and haven't seen any particular need for a new chromosome table, etc. But if those features were to appear in tskit, we could certainly adapt to using them.

My main question is: would this just be essentially semantic information, or would some real, substantial benefit accrue as a result? If it's just semantics, then so far SLiM users don't seem to care, and I would lean toward "why complicate our lives further?" But if there were a substantial benefit, then that's another matter. Apropos of @molpopgen https://github.com/molpopgen's question about simplification, for example: if knowing about chromosome structure would allow a 5-chromosome model to simplify much faster, or with a much lower peak memory usage, than a 1-chromosome model of the same total genome length, that would suddenly look very interesting indeed, I think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/176#issuecomment-483839707, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnH94UubwWDbz9_XM_Km-PzCQtDUiMks5vhjjCgaJpZM4czoSO .

hyanwong commented 5 years ago

A trivial comment, but as per https://github.com/tskit-dev/tskit/issues/29 I would like to deprecate the word "length" to refer to genome length, and replace it with "span", or something less easily confused with branch lengths. So I would vote for deprecating sequence_length (and if you go down the route of a new chromosome table, I would have ID, name, span and metadata as the columns).

I like the original idea proposed by @jeromekelleher, and don't see it as problematic to create a new table, after all, if @molpopgen recommends...

keeping the individual chromosome lengths as a global variable in a python script

...then that's essentially the same as a small table, but the table idea has the advantage of keeping names and other metadata associated with the chromosomes. The burden would be in the extra column in the edge table.

Also, there is a major evolutionary distinction between the autosomes and the X chromosome, which makes me think that simply keeping track of boundaries along a continuous span of all chromosomes concatenated together might not be a very good idea. As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular.

Much down the line, I wonder if it would ever be possible for sites to move between chromosomes via translocation (but this is a graph-genomes question, and way off-topic)

molpopgen commented 5 years ago

I like the original idea proposed by @jeromekelleher, and don't see it as problematic to create a new table, after all, if @molpopgen recommends...

keeping the individual chromosome lengths as a global variable in a python script

Kind of -- it is really just the genetic map. It is needed to parameterize the forward simulation anyways.

...then that's essentially the same as a small table, but the table idea has the advantage of keeping names and other metadata associated with the chromosomes. The burden would be in the extra column in the edge table.

The burden is that the existing algorithm breaks. If an individual inherits the following two edges, which are (chrom, left, right) tuples, and L_i is the length of chromosome i:

0 L_0 - x L_0 1 0 y,

then the edges are "sorted", but not in a way that is supported by the current code. Further, the notion of scanning with respect to the "current left position" breaks, and must be replaced by the tuple (chrom, left), with comparison operators written for nested ordering ,etc..

So, if chromosome tables are going to be optional, then you end up with two code paths for sorting and simplification. Alternately, you need to manually add a dummy chromosome id for the case where no chromosome table is present.

Also, there is a major evolutionary distinction between the autosomes and the X chromosome, which makes me think that simply keeping track of boundaries along a continuous span of all chromosomes concatenated together might not be a very good idea.

I've mocked this up in Python for sex chromosomes. It isn't that hard--there is some extra external book-keeping, but that's it. The bigger problem is downstream, running a "variant iterator" over an individual that is heterogametic.

As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular.

Circular chromosomes should be easy for forward sims--it is the responsibility of the simulator to generate proper edges when a GC event spans the "0" point of a circle.

Much down the line, I wonder if it would ever be possible for sites to move between chromosomes via translocation (but this is a graph-genomes question, and way off-topic)

jeromekelleher commented 5 years ago

One question: if you add chromosome labels, and edges on chromosome i are in [0,i-1), then do you plan to simplify separately by chromosome or change how edges are sorted to include the chromosome label?

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

I completely agree with you here @bhaller and @molpopgen that there's no point in doing this unless there's real gains in expressivity (i.e, being able to encode real biology that otherwise very awkward/impossible) or performance gains (e.g., being able to split up simplify). This came out of discussions with @DomNelson, who is interested in these whole-genome simulations, which are currently pretty ugly in msprime.

@hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle?

molpopgen commented 5 years ago

From a forward simulation perspective, an alternative that seems more natural is to keep an array of table collections, and simplify each independently. Further, I then know immediately how to parallelize the simplification of each chromosome, and it would be lock-free.

This doesn't solve the problem for a coalescent simulation, though.

On Wed, Apr 17, 2019, 7:57 AM Jerome Kelleher notifications@github.com wrote:

One question: if you add chromosome labels, and edges on chromosome i are in [0,i-1), then do you plan to simplify separately by chromosome or change how edges are sorted to include the chromosome label?

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

I completely agree with you here @bhaller https://github.com/bhaller and @molpopgen https://github.com/molpopgen that there's no point in doing this unless there's real gains in expressivity (i.e, being able to encode real biology that otherwise very awkward/impossible) or performance gains (e.g., being able to split up simplify). This came out of discussions with @DomNelson https://github.com/DomNelson, who is interested in these whole-genome simulations, which are currently pretty ugly in msprime.

@hyanwong https://github.com/hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tskit-dev/tskit/issues/176#issuecomment-484125979, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnHx0zzrSIAaI6j2IssKAPYuognQPJks5vhzYkgaJpZM4czoSO .

hyanwong commented 5 years ago

@hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle?

Certainly I think that we want to be able to restrict recent events such that multiple discontiguous regions (on different chromosomes) pass through a single individual ancestor. Essentially this would appear as "trapped material" if we treat all chromosomes as one long concatenated sequence. But I don't see how this is solved by creating a chromosomes table?

jeromekelleher commented 5 years ago

From a forward simulation perspective, an alternative that seems more natural is to keep an array of table collections, and simplify each independently. Further, I then know immediately how to parallelize the simplification of each chromosome, and it would be lock-free.

It's definitely attractive (and where we started when talking about this), but we'd have a lot of duplication for, say, Individuals and it would get tricky to keep the different table collection in sync. Since the whole point is to allow us to track where different chromosomes go through the same individual, I think it would cause problems...

jeromekelleher commented 5 years ago

Certainly I think that we want to be able to restrict recent events such that multiple discontiguous regions (on different chromosomes) pass through a single individual ancestor. Essentially this would appear as "trapped material" if we treat all chromosomes as one long concatenated sequence. But I don't see how this is solved by creating a chromosomes table?

Well, I guess it's not so much 'solving' as giving a framework where this can be expressed naturally and the results interpreted in a user-friendly way. Certainly it seems more user-friendly to have something like

for tree in ts.trees(chromosome=22):
     for site in tree.sites():
          # Do something with site which has coordinate == standard reference

vs

for tree in ts.trees():
     for site in tree.sites():
          if chrs[22].start <= site.position < chrs[22].end:
               pos = site.pos - chrs[22].start
               # Pos should now be mapped into standard coordinates

Plus, the first is clearly going to be much more efficient.

jeromekelleher commented 5 years ago

Hmm, last point was wrong. The alternative would be:

for tree in ts.trees(start=chrs[22].start, end=chrs[22].end):
     for site in tree.sites():        
           pos = site.pos - chrs[22].start
           # Pos should now be mapped into standard coordinates

so the efficiency point doesn't matter. Still, seems easier/less error prone than having to subtract offsets from coordinates, doesn't it?

molpopgen commented 5 years ago

From a forward simulation perspective, an alternative that seems more natural is to keep an array of table collections, and simplify each independently. Further, I then know immediately how to parallelize the simplification of each chromosome, and it would be lock-free.

It's definitely attractive (and where we started when talking about this), but we'd have a lot of duplication for, say, Individuals and it would get tricky to keep the different table collection in sync. Since the whole point is to allow us to track where different chromosomes go through the same individual, I think it would cause problems...

I do individuals rather differently, so there is less overhead. For me, a bigger problem with this approach is how to deal with mutations, etc., as this would break existing data models.

I do like the idea of simplifying in parallel, though!

molpopgen commented 5 years ago

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

In essence, this changes the definition of a position from a double to a tuple/struct, which is relatively easy to accommodate.

jeromekelleher commented 5 years ago

I do like the idea of simplifying in parallel, though!

Nothing stopping you doing it within fwdpp! Whichever way we end up encoding things for the canonical representation, internally it could make a lot of sense to have an array of table collections and do a bit of bookkeeping at export time.

molpopgen commented 5 years ago

Nothing stopping you doing it within fwdpp! Whichever way we end up encoding things for the canonical representation, internally it could make a lot of sense to have an array of table collections and do a bit of bookkeeping at export time.

It has been a TODO item on the kind of TODO list that you secretly never want to get around to.

molpopgen commented 5 years ago

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

Just thought a bit more about this: if we are just redefining what a position means, then simplification proceeds exactly as it does now, just with a slightly more fiddly comparison function. Thus, the only way savings can come about is if we simplify into multiple table collections one chromosome at a time, and the savings must at most be len(total number of edges)/(num chromosomes)*sizeof(edge) in terms of extra RAM needed to simplify. Does that sound right?

jeromekelleher commented 5 years ago

I was thinking of doing it as a for-loop over chromosomes, but maybe this wouldn't work at all actually and we do need to process all edges for a given parent over all chromosomes at the same time. In either case, there's not going to be any real performance difference, so I take that one back.

molpopgen commented 5 years ago

I think that this is right, at least with the way that we have been thinking about the problem.

EDIT: if you give up some things, specifically non-sample nodes meaning the same thing across chromosomes, then you can do it per chromosome, I think, but you won't know until you try!

bhaller commented 5 years ago

A question regarding this comment from @jeromekelleher:

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

The process of simplification seems, in SLiM, to spend a great deal of time sorting; I think it's usually around 50% of the total simplification time, pre-sorting the tables, IIRC. If we add another key to the sorting comparison function, and it's the first key compared, that will presumably add overhead, and it'll be completely wasted overhead for models that involve just a single chromosome. And presumably quite a bit of the time that is spent sorting is spent in the comparison function. So do we have a sense of how much this would be expected to slow down the base case of simulating a single chromosome? I would hate to see tree-seq recording get slower for all users because of a "fringe" feature. Would it be easy to use different comparison functions depending upon whether multiple chromosomes are being simulated or not?

As this discussion continues, I personally feel like I'm leaning further toward "no thanks", or at least "not now, and not until all the details have been sorted out quite concretely". Is that how others are feeling, or not? To me it just feels like a bunch of work and disruption (a change to the table formats and thus a change to the file format, which is a headache, to begin with), and possible negative performance implications for most simulations, for unclear/fringe benefit. I would also like to hear what @petrelharp thinks.

molpopgen commented 5 years ago

@bhaller -- can you distinguish time in qsort from time in sort_tables? The latter is actually a copy-qsort-copy.

bhaller commented 5 years ago

I believe the "around 50% of total" is in qsort(), yes. I don't think the copies take long. I can take a profile if this seems surprising, it's been a long time since I looked.

molpopgen commented 5 years ago

no, that makes sense. thanks.

molpopgen commented 5 years ago

Different comparison functions are probably not too hard to do in the sorting. One challenge is that you'd need to introduce another one for all the left/right tracking business. That would add a fair bit of additional function pointer de-referencing which is part of what bogs qsort down (compared to a typical application of std::sort).

jeromekelleher commented 5 years ago

I'm with @bhaller here on the 'not right now, thanks' assessment. The idea came up during a discussion and I wanted to get some feedback on what would be involved and how much hassle it would be. This isn't a small change, so we definitely wouldn't make it lightly, without considering performance implications and taking the impact on downstream users into account.

Let's keep the issue open for discussion anyway though.

petrelharp commented 5 years ago

In terms of simplification then, each chromosome would be simplified separately,

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes. It would be nice to be able to parallelize, though!

I also lean towards 'no thanks'. An extra benefit of sticking all chromosomes end-to-end is that it makes plotting easier.

molpopgen commented 5 years ago

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes.

I think this is one of the stickier bits, actually.

petrelharp commented 5 years ago

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes.
I think this is one of the stickier bits, actually.

Not that bad, I think, since we would have the (old nodes) -> (new nodes) map for each chromosome.

molpopgen commented 5 years ago

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes. I think this is one of the stickier bits, actually.

Not that bad, I think, since we would have the (old nodes) -> (new nodes) map for each chromosome.

I think its a little fiddly when one node has different output IDs on different chromosomes?

petrelharp commented 5 years ago

I think its a little fiddly when one node has different output IDs on different chromosomes?

Fiddly, maybe - you'd need a "merge" step where you step through all the node tables and their translation tables, but it'd just take one pass by sortedness.

benjeffery commented 2 years ago

@hyanwong and I were just talking about this issue and came up with two proposals where the chromosomes are stored as one continuous tree sequence, but where there is API help to access what you need. Both require storing the chromosome mapping, but this can just be a simple array and doesn't have to be a full table:

  1. Add a position mapper that is used wherever you want to translate coords - e.g. tree = ts.at(position=ts.chromosome_map(chromosome=3)(postion=12345))

or

  1. Add a method to TreeSequence that makes a new ts where all positions are translated on API calls: e.g.:
    ts_chrom3 = ts.chromosome(chromosome=3)
    tree = ts_chrom3.at(position=12345)

I don't see us getting to this feature soon, but wanted to record the discussion.

hyanwong commented 2 years ago

I like 2. and with the new zero copy stuff this should be low cost, I think.