gvbarroso / iSMC

The integrated Sequentially Markovian Coalescent
GNU General Public License v3.0
11 stars 3 forks source link

Scaling in demography.txt #13

Closed andreaswallberg closed 1 year ago

andreaswallberg commented 1 year ago

Dear developers,

I wonder of the three columns:

lower_bound upper_bound scaled_coalescence_rate

in demography.txt are scaled and how I go from those to Ne over time (assuming I know mutation and generation rate).

Thanks!

andreaswallberg commented 1 year ago

I also wonder how rho is scaled in estimates.txt.

It it the regular rho per base ρ = 4NEr, such that r = ρ / 4*NE ?

How are per-base estimates of for example rho corrected for masked/missing data?

gvbarroso commented 1 year ago

Hi Andreas,

Our model is based on PSMC and MSMC. As such, it scales parameters from their physical units to coalescent units based on N_0 (see supplement of Li and Durbin, 2011).

Thus the first two columns of demography.txt are time boundaries in such coalescent units, and the scaled coalescence rates are also relative to N_0 (in that regard our output differs from MSMC, for example). To convert back to physical units you would first get N_0 by dividing theta by 4 * mu (your mutation rate per site per generation). Then the first and second columns of the demography output are converted to units of 'number of generations' by simply multiplying them by N_0. Taking the inverse of the third column and then multiplying by N_0 would give estimates of effective population sizes (it is a scaled relative coalescence rate, after all).

Likewise, rho is also scaled by N_0 in estimates.txt and is estimated per site, as you guessed. Such estimate is not explicitly "corrected" for the presence of missing data, but it is naturally made worse by it. Our HMM model, like PSMC/MSMC, "skips over" missing sites by letting the emission probability of missing data be equal to 1.0 for every hidden state. The extent of the error introduced by varying proportions of missing data has not been benchmarked, but as a guideline we recommend not running the model in datasets with > 15% missing data. In those cases, we recommend filtering the data a priori by removing segments containing a lot of missing data.

Hopefully this helps. Cheers, Gustavo

jydu commented 1 year ago

Just a small note to add to Gustavo's reply: there is a small difference between iSMC and MSMC there. MSMC outputs coalescent rates that have already been multiplied by theta (=4N0mu), so that population sizes can be obtained by dividing by the mutation rate directly. For iSMC, you have to multiply by theta first.

Julien.

andreaswallberg commented 1 year ago

Many thanks for the feedback!

andreaswallberg commented 1 year ago

Hi Andreas,

Our model is based on PSMC and MSMC. As such, it scales parameters from their physical units to coalescent units based on N_0 (see supplement of Li and Durbin, 2011).

Thus the first two columns of demography.txt are time boundaries in such coalescent units, and the scaled coalescence rates are also relative to N_0 (in that regard our output differs from MSMC, for example). To convert back to physical units you would first get N_0 by dividing theta by 4 * mu (your mutation rate per site per generation). Then the first and second columns of the demography output are converted to units of 'number of generations' by simply multiplying them by N_0. Taking the inverse of the third column and then multiplying by N_0 would give estimates of effective population sizes (it is a scaled relative coalescence rate, after all).

Likewise, rho is also scaled by N_0 in estimates.txt and is estimated per site, as you guessed. Such estimate is not explicitly "corrected" for the presence of missing data, but it is naturally made worse by it. Our HMM model, like PSMC/MSMC, "skips over" missing sites by letting the emission probability of missing data be equal to 1.0 for every hidden state. The extent of the error introduced by varying proportions of missing data has not been benchmarked, but as a guideline we recommend not running the model in datasets with > 15% missing data. In those cases, we recommend filtering the data a priori by removing segments containing a lot of missing data.

Hopefully this helps. Cheers, Gustavo

Just a quick followup here. So instead of just setting the sequence to "N" or specifying high-resolution coordinates in a BAM file, you suggest cutting regions with much missing data completely from the dataset?

In my case, my main aim is to infer an approximate overall recombination rate for my non-model species, rather than localized high-resolution recombination rates. Is the tolerance for missing data higher when aiming for a chromosome/genome-scale average or will the extent of error be really bad here too with more missing data?

My scaffolds range anywhere between 10-50% of missing data, but theta is high at over 1%, so there are many SNPs even across shorter sub-sequences with little missing data compared to humans for example. Any rule of thumb having to do with lower boundaries of the number of SNPs in blocks?

gvbarroso commented 1 year ago

It is unclear to me whether the tolerance for missing data is higher when one wants an estimate of genome-wide average rho. But as I see it, this (not focusing on spatial recombination maps) is one more reason to a priori exclude regions with high missing data. Especially in species where theta is high, you will probably be left with enough information in the data to infer rho.

It's of course hard to give any rule of thumb. In general you want reasonably long blocks to infer parameters well (you don't want to reset the Markov chain every 100kb, for example). How does one define "reasonably long"? It has to do with the number of mutation and recombination events that have happened within the history of such block(s). This means that more diversity will help, because high theta usually means high Ne, therefore higher rho and more ancestral events per block. Naturally, not all of your blocks must have the same length (just like in a real genome, chromosomes vary in length as well). Given your goal, what is the distribution of block lengths if you only keep those with < 20% missing data? Maybe try a handful of combinations and see how your estimates vary?

I could spend many years working on iSMC-related questions, but time is a limiting factor at this time I prefer to diversify my interests a bit. These are just general guidelines. Hopefully you find them useful.

andreaswallberg commented 1 year ago

Great answers! Thanks again!

I have another question. The reported theta is a bit lower than the theta I compute myself while correcting for accessible (i.e. unmasked) sites. Which theta do you suggest I infer Ne from in order to rescale rho into r per base?

Cheers!

gvbarroso commented 1 year ago

That's odd, because theta is not included in the numerical optimisation of parameters but instead is fixed to the average number of pairwise differences between pairs of genomes. This accounts for missing data, but maybe there's a small bug in there...? What are the values you are getting with both methods, and how are you computing theta by yourself?

andreaswallberg commented 1 year ago

Hi!

For a single diploid genome, I just derive (watterson's) theta as the number of heterozygous genotypes divided by the number of accessible sites (across the genome or region). For a large population sample, I correct for sample size as per the standard approach: https://en.wikipedia.org/wiki/Watterson_estimator

If I recompute theta as the number of heterozygous genotypes divided by the full length of the sequence, my theta comes very close to yours. This makes me think iSMC may not actually subtract the masked N:s from the computation of theta.

gvbarroso commented 1 year ago

We use Tajima's estimator (average number of pairwise differences), but are identical for n = 2. Indeed, iSMC should be taking the number of heterozygous sites and dividing by the number of accessible sites. I will check that, but meanwhile, could you please report the numbers that you are getting?