IBD calling is heavily biased in two deme simulations

Intro to issue

Simulating under one deme & constant size population shows that IBD calling works pretty well. More specifically, the number of IBD blocks greater than a threshold estimated from RefinedIBD agrees well with the analytical value derived from coalescent theory. However, under two deme & constant size population, there is a strong bias to call many IBD segments within populations and call much fewer than expected for between populations.

Details

To illustrate, I've conducted simulations with ms under a two deme symmetric model such that: m = 0.008 (migration rate) 2N = 15,000 (diploid population size). n = 200 (200 haploids sampled)

Afterwards, I filtered SNPs such that the SNP density becomes similar to array data. Then I called IBD segments using RefinedIBD with default settings.

Here, I plot the number of IBD segments shared between individuals with the threshold set to 4cM.

pairwiseibd

Let's denote the mean IBD segments shared within a population be _lambdaw and between a population _lambdab.

Empirical: _lambdaw = 0.5655556 _lambdab = 0

Analytical: _lambdaw = 0.2072319 _lambdab = 0.05777754

You can see that RefinedIBD calls more segments than expected within populations and less segments than expected between populations. This is true irregardless of the threshold used.

Initial guess at the problem

Because of the two deme model, most SNPs will fixed in one population and absent in the other. This will make the data look like there are long stretches of homozygosity within individuals of the same population and very little between populations. Hence, more IBD segments will be called within populations and less between populations.

How to test my guess

Filter SNPs so that we only keep SNPs that are polymorphic in each population.

Ramifications to real data and other studies

It's not clear how this issue will affect real data. It seems that most SNPs are polymorphic in all European populations because of the recent expansion. It seems like Ralph & Coop, 2008 didn't think this was a problem for the POPRES data-set.

halasadi / MAPS

IBD calling is heavily biased in two deme simulations #1