stschiff / msmc2

GNU General Public License v3.0
53 stars 9 forks source link

Ne explodes close to the present #30

Closed evo-eco-gen closed 3 years ago

evo-eco-gen commented 3 years ago

Hello,

I have a weird problem with MSMC2. I use it to study demography of a non-model species, with populations diverged over the last 100,000-1M years (the times are deeper than for the usual human work).

  1. In almost all cases, MSMC based on SHAPEIT-phased haplotypes recovers ridiculous increases in Ne close to the present, like an order of magnitude boom in the last 10Kyr (last 4-5 slice of time, corresponding to the first term in -p). This seems very unlikely and does not correlate with differences between the populations (plus the inferred Ne values are way to high for this animal). The middle of the curves looks reasonable, and consistent with PSMC on unphased data. [However: PSMC with the same data before phasing also often yields explosive Ne] The only solution I have for this is to tie the most recent slices together, so the Ne stays flat: '-p 1x4+25x1+1x2+1x5' [using x instead of * because o github's formatting]
  2. Similar weird thing happens in the deep past - the inferred Ne for the first time slice is ridiculously high.

What could cause these behaviours? I don't care about the very recent or the very deep past, but why is this happening?... I would like to think the input data, masks etc are of good quality...

Cheers, evoecogen

stschiff commented 3 years ago

I think this can happen due to overfitting, and there isn't a really good way to control for that, other than - as you already did - to reduce the number of parameters by gluing together time segments.

evo-eco-gen commented 3 years ago

Thanks! Could this be related to imperfections of statistical phasing? The dataset is 200 individuals only, we used phase-informative reads and genetic maps, but still not nearly as good as current human data.

evo-eco-gen commented 3 years ago

Also, is there a clear reference on how to choose the SMC parameters? I really struggle to understand why the numbers are what they are, both in the defaults and in various papers using SMC (people tend o change them without much explanation).

stschiff commented 3 years ago

Sure, phasing can also play a role. Hard to test, though. You could run simulations and artificially introduce phasing errors to see how bad the situation can get. What do you mean by "SMC parameters"?

evo-eco-gen commented 3 years ago

Yes, a simulation will be useful. I meant -p, so how to choose the number of time intervals and how they are bound together.

stschiff commented 3 years ago

It's mostly experienced based. I have run simulations and tried different combinations to get some idea about this. It's hard to come up with a fully objective way to optimise this. I think my approach is similar to Li and Durbin's approach in their PSMC paper.

charlesfeigin commented 1 year ago

Hi, I had a related question. Do you think that overfitting driving extremely high Ne only at the most ancient time segments invalidates the trajectory of Ne at more recent periods, or is the whole plot likely to be unreliable? Thank you.

stschiff commented 1 year ago

Impossible to say in general, I am afraid. Try to reduce the number of parameters by gluing together time segments in both ends, to see whether the increase gets mitigated, suggesting overfitting.