tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Support genetic coordinates #347

Closed apragsdale closed 5 years ago

apragsdale commented 5 years ago

Is there currently a nice way to record the genetic positions (instead of only physical positions (bp)) in tree sequences? Suppose I simulate with some recombination map (or have data encoded as a tree sequence, and want to store genetic positions of sites or tree intervals using some external map) - what is the best way to have this information accessible within the tree sequence object?

For example, if I iterate through either mutations variants on the tree, I can get the physical position right away by m.position or v.site.position, or the interval of a tree in bp by tree.interval. If I want to also be able to get, say, m.genetic_position or tree.genetic_interval, I imagine this might not be supported. Would I instead need to copy each table, add this information to the metadata of tables that record coordinates, and create a new tree sequence from those tables?

jeromekelleher commented 5 years ago

The only way to do this at the moment would be to translate the coordinates on-demand @apragsdale. This should be straightforward enough to do though:

recomb_map = msprime.RecombinationMap() # load this from whereever
for var in ts.variants():
     genetic_pos = recomb_map.physical_to_genetic(var.site.position)

It might not be the most efficient, but I doubt it would be much of a time-sink in practise.

We don't have any concept of genetic coordinates within the tree sequence at the moment, just physical. I guess we could add support in two different ways:

  1. Allow the user to set a recombination_map instance to a TreeSequence class, which if present, would allow us to support things like site.genetic_position and tree.genetic_interval.
  2. Store the actual genetic map within the trees file and support these attributes directly.

I'm not sure it's a common enough requirement to justify this complexity though --- it doesn't seem that hard to translate the coordinates on the fly?

apragsdale commented 5 years ago

Thanks Jerome! That makes sense - admittedly this is a pretty niche requirement, so it's certainly not high on the priority list. What I have in mind is computing various LD statistics, which are shaped by genetic instead of physical distances between loci, so if I want to get LD as a function of genetic distance over all pairs of SNPs, I need to check a lot of pairwise distances. Repeatedly calling the physical to genetic coordinate function can be avoided by using recomb_map.physical_to_genetic, like you said, and then caching and saving/loading those genetic positions separately for future use.

I guess it's only really useful if you're looking at LD or IBD, which might not justify adding this support at the moment. I'll close this unless you think it would be a useful feature.

jeromekelleher commented 5 years ago

It's a good idea @apragsdale, but not pressing enough to follow up on right now I think.