popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
122 stars 86 forks source link

recombination map liftOver thread #571

Open jradrion opened 4 years ago

jradrion commented 4 years ago

I'm working on a liftOver pipeline (using crossmap https://crossmap.readthedocs.io/en/latest/) to bring all of our maps over to their respective Ensembl annotation assemblies.

I've just started playing around with chr22 from the human map, and I've noticed that some of the regions not lifting over correspond to Hapmap entries that I assume are themselves the result of previous liftOver anomalies (based on positional weirdness), for examples this block:

chr22 22394144 32.912633 16.741969 chr22 22394227 32.912633 16.744701 chr22 22394241 32.912633 16.745162 chr22 22394743 32.912633 16.761684 chr22 22394749 32.912633 16.761881

Since the HapmapII map that we use (hg19) was itself lifted over from hg17, I'm wondering if it wouldn't make sense to go directly from hg17 to hg38?

I suppose the issues with doing this would be tracking down the original LDhat data in hg17, and the fact that the previous liftOver slightly modified some intervals as described in the README.

Once the liftOver was completed, the map was inspected for regions in which the genome assembly had be rearranged. Such rearrangements result in a dip in the lifted genetic map (i.e. a negative recombination rate). These regions were removed by setting the recombination to zero within, and for 50 SNPs either side of, such regions. These regions are fairly rare, and this method removed a total of 2013 SNPs from the following chromosomes.

Do people think we should just stick with lifting over directly from the Hapmap files we've been using?

Also, there were some suggestions made about way to deal with intervals that don't map using liftOver, or overlapping intervals in the new positions. I think it would helpful to brainstorm some criteria/rules to use in this thread.

ndukler commented 4 years ago

I would agree that it makes the most sense just to liftOver directly hg17 --> hg38 but if it's too much of a pain we can just stick to what we have. For the missing bits we can use either chromosomal averages or some moving average.

jradrion commented 4 years ago

One issue that I've seen pop up more than once is that old intervals may be broken into new intervals with a single bp not being included, e.g.

chr22 15918273 15919124 5.689620 chr22 15919125 15919155 5.689620 chr22 15919156 15921153 5.689620

Having a rule that joins these intervals would appear to make more sense than inserting a new 1bp interval with the chromosome average recombination rate.

IIRC @ndukler suggested something like using a weighted average to convert these new coordinates to intervals of a consistent length? That could work too. Any thoughts on what we might like that interval to be?

Is there a better way to do this?

UPDATE: The issue with the coordinates above appears to be an artifact of using crossmap, and is not present when using the UCSC liftOver tool.

ndukler commented 4 years ago

Why not just try a few imputation schemes on extant data and see which works best? I would imagine some sort of weighted average based on distance from focal bin and a prior based on the chromosomal average would do reasonably well. Or, just try fitting a gaussian process model?

ndukler commented 4 years ago

Have to handle the blockyness of the data somehow but it should be doable?

apragsdale commented 4 years ago

The Beagle software has available a lifted over recombination map in build 38 - in the readme from the lifted recombination maps (found here http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/), they explain their approach to dealing with inversions or discrepancies between genome builds, if that's helpful.

DeCode also has lifted over maps, and I think Jeff Spence inferred maps directly on hg38 (https://github.com/popgenmethods/pyrho), if we want to include those resources instead or in addition.

grahamgower commented 4 years ago

chr22 15918273 15919124 5.689620 chr22 15919125 15919155 5.689620 chr22 15919156 15921153 5.689620

Is it possible that closed intervals have snuck in somehow?

jradrion commented 4 years ago

Is it possible that closed intervals have snuck in somehow?

It's certainly possible. I'll try lifting over with the UCSC tool to see if I get the same coordinates.

Update: OK, not going to be using crossmap anymore, that's for sure! Back to the UCSC tool.

jeffspence commented 4 years ago

Hi everyone! I'd be happy to contribute the maps I inferred here: https://advances.sciencemag.org/content/5/10/eaaw9206. I have 26 (one for each population in 1000 Genomes). They were inferred on hg19 and hg38. I'd be happy to submit a PR to include them if there's interest and if somebody would let me know what modifications to the code and documentation are needed to get them included.

andrewkern commented 4 years ago

Awesome! A decent place to start might be to have a look at PR #363 along with the developer docs

andrewkern commented 4 years ago

Also how are we doing uploads to AWS moving forward? @ndukler are you in charge of that? Should we distribute that responsibility?

ndukler commented 4 years ago

We probably should distribute that responsibility going forward but for the time being I'm happy to continue handling the uploads. For specific instructions on adding a genetic map see here. Right now we would add the hg19 version but we will be migrating to hg38 at some point soon so it would need a minor update then.

jradrion commented 4 years ago

Here's a look at my first attempt at remapping the human map from hg19 to hg38. It looks decent, but I'm sure it could be improved.

It is a simple liftOver of the HapMapII_GRCh37 coordinates to hg38, where any gaps between new intervals in hg38 are filled with the chromosome average rate and all overlapping intervals in hg38 are left unaltered. Next, a weighted average is taken using non-overlapping 1kb bins, whereby any intervals that overlapped as a result of the liftOver are now averaged and leading zeros are inserted until the start of the first remapped coordinate (since msprime inserts them anyhow).

Here is chr1 plot_map_chr1.pdf and here is chr22 plot_map_chr22.pdf The original HapMapII_GRCh37 is shown in grey and the new hg38 map is shown in red.

jradrion commented 4 years ago

One thing that stands out is that coordinates around the centromere in chr1 are unmapped in hg38, and since I'm filling gaps with the chromosome average rate, those positions are remapped as having non-zero rates. I suppose one way to get around this would be to have a simple gap threshold where gaps > ~1Mb are filled with zeros instead of the chromosome average?

Another thing to point out is because the resulting map is a weighted average using non-overlapping 1kb windows, the magnitude of some of the hotspots is diminished.

jradrion commented 4 years ago

OK, here are the maps using a gap threshold of 1Mb for setting the rate to zero, along with using 100bp windows for the weighted average and joining adjacent windows with identical rates (when rounded to 2 decimal places). The resulting maps look much closer to the originals. plot_map_chr1.pdf plot_map_chr22.pdf

jeromekelleher commented 4 years ago

Thanks for all the hard work @jradrion, this is vital stuff!

jradrion commented 3 years ago

I added some new validation and plotting code to my liftOver PR. I'm now automatically lifting back from the new map to the original map, and then calculating the deviation in rates along the chromosomes as a validation/sanity check. The resulting validation for humans is as follows all_chromosome_deviations

If you want a detailed look at each chromosome, running the liftOver code will generate a validation directory with the following two figures for each chromosome: 1) The original assembly vs new assembly coordinates/rates chr22

2) A map of the unnormalized differences in rates along the chromosome (rates from the back-lifted original assembly - original rates), along with violin and boxplots showing this distribution. chr22_validation

jradrion commented 3 years ago

If people would like to eyeball the other chromosomes for humans, they can be found here liftOver_validation.zip

I will start working on the other species and add them to this thread as I generate them.

jradrion commented 3 years ago

Also, in case anyone wants to run the code for themselves, or try it out on another species, the command used to generate the liftOver/figs above was the following:

python ./stdpopsim/maintenance/liftOver_assemblies.py \
--species HomSap \
--map HapMapII_GRCh37 \
--chainFile ./liftOver_chains/hg19ToHg38.over.chain.gz \
--validationChain ./liftOver_chains/hg38ToHg19.over.chain.gz \
--winLen 100 \
--retainIntermediates \
--joinThresh 1 \
--gapThresh 1000000
grahamgower commented 3 years ago

The plots look really good @jradrion! I just have one question: in your 1. The original vs new map coordinates/rates plot are the red and gray lines deliberately offset horizontally? I guess you did this to so the lines are visible?

jradrion commented 3 years ago

@grahamgower that's just showing the actual change in coordinates between the two assemblies. The offset differs by chromosome and can also differ along the chromosomes. For example, here's chr9. chr9

grahamgower commented 3 years ago

Ah, I see! Thanks.

grahamgower commented 3 years ago

Now that #574 has been merged, could we please have a status update on what still needs to be done here? I understand that for some recombination maps, we still have a discrepancy between the genome assembly versions used for our genome definitions (ie. chromosome lengths), and the assembly version that our recombination maps correspond with. I'm guessing some of the recently lifted over maps can now be uploaded? Ping @jradrion, @andrewkern, @ndukler.

Oh, and we definitely need the genome assembly name/version in the file names on AWS (to not conflict with existing files), and we definitely need to keep the old files there (so the current stdpopsim release keeps working)!

grahamgower commented 3 years ago

Did the lifted-over maps get uploaded? I think the PonAbe #544, HomSap #543 and AraTha #539 species QC issues are still open because they need liftover recombination maps uploaded?

andrewkern commented 3 years ago

where are the lifted-over maps? @jradrion are you holding those?

andrewkern commented 2 years ago

hey @jradrion @grahamgower -- just resurrecting this thread as it's time to capitalize on all this nice code / analysis @jradrion has put in place.

I wonder if you guys and others might have a list of species that we need liftovers from? cc: @izabelcavassim

grahamgower commented 2 years ago

HomSap and AraTha are the two early species that still need liftover attention. I don't know about more recently added species.

jradrion commented 2 years ago

Sounds good @andrewkern . I should have some time to take a look this weekend, but what @grahamgower said rings true to me atm.