popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
121 stars 86 forks source link

Human genetic map not working as intended #663

Open Rosemeis opened 3 years ago

Rosemeis commented 3 years ago

Using HapMap genetic map is not giving correct results: I simulated 500 individuals of each of the four populations of the demography model, AmericanAdmixture_4B11, with and without using the provided genetic map (HapMapII_GRCh37) on chromosome 22. Checked the results by performing PCA and saw that something was completely off for the scenario of using the provided genetic map. Similar results when using different demography model.

How to reproduce the problem: Used the following commands using stdpopsim (v.0.1.2): With genetic map stdpopsim HomSap 500 500 500 500 -d AmericanAdmixture_4B11 -g HapMapII_GRCh37 -c chr22 -o amr.adm.hapmap.trees

With uniform map stdpopsim HomSap 500 500 500 500 -d AmericanAdmixture_4B11 -c chr22 -o amr.adm.uniform.trees

I obtained the haplotype matrices of the chromosomes using tskit in Python (ts.genotype_matrix()). I performed standard PCA directly on standardized haplotype matrices and have attached the plot for each of the scenarios.

Cheers, Jonas

amr adm hapmap amr adm uniform

andrewkern commented 3 years ago

hi there-- thanks for checking out the project.

i see that you are simulating chromosome 22. on that chromosome the first ~16Mb are nonrecombining-- did you remove variation from that portion of the chromosome before performing PCA?

the uniform map will not have such large spans of the chromosome represented by one marginal tree, so i would expect to see differences here.

Rosemeis commented 3 years ago

No, I was simply using all the common variants for PCA (maf > 0.05). But I see the same pattern for other chromosomes, so I guess it is due to the tails of the genetic map.

andrewkern commented 3 years ago

for the analysis portion of the stdpopsim paper we use these mask files to remove regions which we considered to be low recombination, by some heuristic rule. This could be a good way to get started.

apragsdale commented 3 years ago

Agree that you'll definitely want to mask out those regions before running any analysis. For human data, you could also use the 1000 genomes strict accessibility mask to filter out SNPs from those regions, since recombination rates are not expected to be inferable in those inaccessible regions of the genome.

Just skimming the documentation now, and I don't see anywhere that we discuss masking out problematic regions (maybe I missed it?). Without mentioning this in the docs, I think it's an issue that could trip up a lot of users.

petrelharp commented 3 years ago

This exact thing happens in real human data too, FYI. So, maybe it's a feature, not a bug. =)

Agreed we should discuss this. And maybe even this precise example deserves a writeup in the docs?