Is that a way to compare different models?

oushujun commented 6 years ago

In a two-ancestry scenario, one of the ancestral population (-p 0) may contribute more than one pulse at different times. To estimate this, I specified multiple pulse introgressions from that population, and joint estimate both times and pulse sizes.

One pulse model, joint estimate the pulse time of both ancestral populations: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 -10000 0.56 -p 0 -8000 0.44 -g -e 1e-4 --ne 1e4 -v --tmax 100000 > one-pulse.out

Result:

type time proportion 1 100000 0.56 1 100000 0.56 0 100000 0.44 0 100000 0.44

Two pulse model, treat -p 1 as the ancestral population, estimate two pulses from -p 0 for both pulse size and time: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 1000000 0.56 -p 0 -8000 -0.22 -p 0 -4000 -0.22 -g -e 1e-4 --ne 1e4 -v --tmax 100000 > two-pulse.out

Result:

type time proportion 1 1e+06 0.56 1 1e+06 0.56 0 499.575 0.109798 0 499.575 0.109798 0 100000 0.330202 0 100000 0.330202

Three-pulse model, similar to the last model, but add one more -p 0 pulse and time for estimation: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 1000000 0.56 -p 0 -8000 -0.22 -p 0 -4000 -0.11 -p 0 -1000 -0.11 -g -e 1e-4 --ne 1e4 -v --tmax 100000 > three-pulse.out

Results are pending for the three-pulse model.

I also used bootstraping to find the confidence intervals for these estimations. Is that a way to compare the fitting of these different models?

Thanks, Shujun

russcd commented 6 years ago

Hi Shujun,

In general when one of the ancestral pulse times spikes up to the maximum, which is what's happening with your pulse two, it suggests that there is a problem with the model fit.

One cause could be that you have a good amount of allele frequency change between your putative ancestral population and the ancestry components in your admixed population.

Another can be significant ancestry LD. Did you prune sites in strong LD in the two ancestral populations?

On Fri, Oct 5, 2018 at 8:11 AM Shujun Ou notifications@github.com wrote:

In a two-ancestry scenario, one of the ancestral population (-p 0) may contribute more than one pulse at different times. To estimate this, I specified multiple pulse introgressions from that population, and joint estimate both times and pulse sizes.

One pulse model, joint estimate the pulse time of both ancestral populations: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 -10000 0.56 -p 0 -8000 0.44 -g -e 1e-4 --ne 1e4 -v --tmax 100000 > one-pulse.out

Result:
type    time    proportion
1       100000  0.56
1       100000  0.56
0       100000  0.44
0       100000  0.44
Two pulse model, treat -p 1 as the ancestral population, estimate two pulses from -p 0 for both pulse size and time: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 1000000 0.56 -p 0 -8000 -0.22 -p 0 -4000 -0.22 -g -e 1e-4 --ne 1e4 -v --tmax 100000

two-pulse.out

Result:
type    time    proportion
1       1e+06   0.56
1       1e+06   0.56
0       499.575 0.109798
0       499.575 0.109798
0       100000  0.330202
0       100000  0.330202
Three-pulse model, similar to the last model, but add one more -p 0 pulse and time for estimation: ancestry_hmm -i inupt.panel -s input.sample -a 2 0.44 0.56 -p 1 1000000 0.56 -p 0 -8000 -0.22 -p 0 -4000 -0.11 -p 0 -1000 -0.11 -g -e 1e-4 --ne 1e4 -v --tmax 100000 > three-pulse.out

Results are pending for the three-pulse model.

I also used bootstraping to find the confidence intervals for these estimations. Is that a way to compare the fitting of these different models?

Thanks, Shujun

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/russcd/Ancestry_HMM/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AJmQMbi0rhTLDwKMvDYGsQVQ0Z4Wgo1Hks5uh3amgaJpZM4XKanM .

-- Russ Corbett-Detig Assistant Professor Department of Biomolecular Engineering University of California, Santa Cruz

oushujun commented 6 years ago

Hi Russ,

Thanks for the insights.

One cause could be that you have a good amount of allele frequency change between your putative ancestral population and the ancestry components in your admixed population.

Did you mean that allele frequencies between -p 0 and the admixed population are too different that -p 0 may not be a proper potential ancestral population? Is that a way to fix this?

For LD pruning, I pruned sites in the admix population (r2<0.1) instead of the two ancestral populations. Could this be the problem?

Thanks, Shujun

oushujun commented 6 years ago

I am trying to LD prune the two reference populations instead of the admix population. If I apply LD prune of r2=0.1 for both populations, then there would be less then 5k SNPs for the whole genome. If r2=0.2, then 10 K; and 17K SNPs for r2=0.3. Is there a recommended balance between LD and SNP number for ancestry_hmm?

russcd commented 6 years ago

Hi Shujun,

Unfortunately, there really is no easy answer for how much LD pruning is the right amount. We're still working on this and it definitely depends on a few things.

The program requires that your ancestral population be pretty close to the actual donor population. It is robust to some difference, but if the two are very distant, this will not work well.

One other question is, when you give the program recombination rates, does this include the decrease in rate due to selfing? I have not worke diwth partial selfing yet, but my suspicion is that you'll want to turn down the recombination rate to include this.

oushujun commented 6 years ago

Hi Russ,

Thanks for the insights. I used R2<0.3 to prune both reference panels and retaining informative sites of 11 Kb, which is comparable to the example dataset.

I then rerun the one-pulse, two-pulse, three-pulse, and four-pulse models, the results seem to be more reasonable:

One pulse model with bootstrap:

type time proportion 0 23922 (+-1093.6) 0.44 1 25060 (+-1100.2) 0.56

Although not reaching the --tmax 1000000, these time estimations are too old than expected, which "should be" within 10000 generations.

Two pulse model:

type time proportion 1 1e+06 0.56 1 1e+06 0.56 0 2661.47 0.241936 0 2661.47 0.241936 0 148.71 0.198064 0 148.71 0.198064

Three pulse model:

type time proportion 1 1e+06 0.56 1 1e+06 0.56 0 1115.86 0.230936 0 1115.86 0.230936 0 90.7466 0.138542 0 90.7466 0.138542 0 26701.1 0.0705214 0 26701.1 0.0705214

An ancient pulse is again observed.

The four pulse model takes too long to run (7 parameters to be estimated).

Of these three estimations, the two-pulse model makes more sense. But are there any criteria or rules of thrumb to decide which model fits better?

Thanks! Shujun

oushujun commented 6 years ago

Oh for the recombination rate, I used a fine-scale recombination map estimated from a QTL population. I realize the concept of "effective recombination rate", which is to use the inbreeding coefficient to adjust for the observed recombination rate. Is it necessary to do so for ancestry_hmm estimations?

Thanks again, Shujun

russcd / Ancestry_HMM

Is that a way to compare different models? #7