Output shows same upper limit on Ne across populations & bootstrap replicates

twooldridge commented 5 years ago

Hi,

I'm currently running SMC++ on data from four populations (each run independently) as well as bootstrap replicates for each population. This tool is working very well (thank you), but one odd thing that I've noticed is that the size estimates will never exceed the exact same value - 9433962.264150944 - even for different populations and bootstrap replicates. This is readily apparent when I plot the size history for all populations, as seen below:

Here are example commands I run that result in this effect. I use the cv command and run each fold independently, then average across folds when each of the jobs is done:

"BK" pop:

# Example to run fold 1
smc++ cv 5.3e-9 smc_input/new/BK.chr*.masked.smc.gz  -o estimates/BK/ --cores 8 --folds 4 --timepoints 1e3 5e7 --fold 1
# Average across folds
smc++ cv 5.3e-9 smc_input/new/BK.chr*.masked.smc.gz  -o estimates/BK/ --cores 8 --folds 4 --timepoints 1e3 5e7

I've attached the "debug" files for runs with both the "BK" and "BW" populations here, two populations that hit this upper limit in the original and bootstrapped versions. I tried to look through them myself to find some clue, but there is not much information there. Have you seen this phenomenon before? I noticed in the original paper that both the finch populations exceed 10^7 individuals at some point, so I'm assuming it's not some limit intrinsic to the program. Please let me know if there's other information I can provide. Thank you in advance for your help!

Brock

BK.debug.txt BW.debug.txt

terhorst commented 5 years ago

See the --Nmax option (smc++ cv -h).

twooldridge commented 5 years ago

Hi,

When I specify the --Nmax option as 1e8, which I hoped would allow the dem. history to exceed the 9433962.264150944 "boundary", I still get the same effect:

smc++ cv 5.3e-9 smc_input/new/BW.chr*masked.smc.gz  -o estimates/BW/ --folds 4 --timepoints 1e3 5e7 --Nmax 1e8 --spline cubic

Here is the resulting plot: BW.pdf

The help info for smc++ cv doesn't indicate default values for Nmax or Nmin either, so it's not clear if there was an upper limit imposed in the first place? I guess in theory it's possible that there is extreme confidence in this being the Ne at those timepoints, but again the fact that I observe the exact same value for different populations at different timepoints makes me suspicious. I've attached the debug file and demographic history in *txt format in case those are helpful.

Thank you! Brock

BW.debug.txt BW.txt

terhorst commented 5 years ago

No, it's due to a threshold set in the code. What version are you using? The output from the most recent version of SMC++ should look like:

$ smc++ cv -h
usage: smc++ cv [-h] [-v] [--cores CORES] [-o OUTDIR] [--base BASE]
                [--timepoints TIMEPOINTS TIMEPOINTS]
                [--nonseg-cutoff NONSEG_CUTOFF] [--thinning k] [-w W]
                [--em-iterations EM_ITERATIONS]
                [--algorithm {Powell,L-BFGS-B,TNC}] [--multi] [--ftol FTOL]
                [--xtol XTOL] [--Nmax NMAX] [--Nmin NMIN]
                [--regularization-penalty REGULARIZATION_PENALTY]
                [--unfold | --polarization-error p] [--knots KNOTS]
                [--spline {cubic,pchip,piecewise}] [-r R] [--folds FOLDS]
                [--fold FOLD]
                mu data [data ...]

positional arguments:
  data                  data file(s) in SMC++ format

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase debugging output, specify multiply times for
                        more
  --cores CORES         Number of worker processes / threads to use in
                        parallel calculations
  -o OUTDIR, --outdir OUTDIR
                        output directory
  --base BASE           base for file output. outputted files will have the
                        form {base}.final.json, etc.
  --timepoints TIMEPOINTS TIMEPOINTS
                        start time of model (in generations)
  --unfold              use unfolded SFS (alias for -p 0.0)
  --polarization-error p, -p p
                        uncertainty parameter for polarized SFS: observation
                        (a,b) has probability [(1-p)*CSFS_{a,b} +
                        p*CSFS_{2-a,n-2-b}]. default: 0.5
  --folds FOLDS         number of folds for cross-validation
  --fold FOLD           run a specific fold only, useful for parallelizing

data parameters:
  --nonseg-cutoff NONSEG_CUTOFF, -c NONSEG_CUTOFF
                        recode nonsegregating spans > cutoff as missing.
                        default: do not recode.
  --thinning k          only emit full SFS every <k>th site. (k > 0)
  -w W                  window size. sites are grouped into blocks of size
                        <w>. each block is coded as polymorphic or
                        nonpolymorphic. the default w=100 matches PSMC.
                        setting w=1 performs no windowing but may be more
                        susceptible to model violations, in particular tracts
                        of hypermutated sites.

Optimization parameters:
  --em-iterations EM_ITERATIONS
                        number of EM steps to perform
  --algorithm {Powell,L-BFGS-B,TNC}
                        optimization algorithm. Powell's method is used by
                        {P,MS}MC and does not require gradients. It may be
                        faster in some cases.
  --multi               update multiple blocks of coordinates at once
  --ftol FTOL           stopping criterion for relative improvement in loglik
                        in EM algorithm. algorithm will terminate when
                        |loglik' - loglik| / loglik < ftol
  --xtol XTOL           x tolerance for optimizer. optimizer will stop when
                        |x' - x|_\infty < xtol
  --Nmax NMAX           upper bound on scaled effective population size
  --Nmin NMIN           lower bound on scaled effective population size
  --regularization-penalty REGULARIZATION_PENALTY, -rp REGULARIZATION_PENALTY
                        regularization penalty

Model parameters:
  --knots KNOTS         number of knots to use in internal representation
  --spline {cubic,pchip,piecewise}
                        type of model representation (smooth spline or
                        piecewise constant) to use

Population-genetic parameters:
  mu                    mutation rate per base pair per generation
  -r R                  recombination rate per base pair per generation.
                        default: estimate from data.

twooldridge commented 5 years ago

The version I'm using is 1.15.1, installed in early December. The help menu you showed looks exactly the same as mine, as far as I can tell.

(py36) [twooldridge@holy7b01215 SMC]$ smc++ version
SMC++ v1.15.1
(py36) [twooldridge@holy7b01215 SMC]$ smc++ cv -h
usage: smc++ cv [-h] [-v] [--cores CORES] [-o OUTDIR] [--base BASE] [--timepoints TIMEPOINTS TIMEPOINTS]
                [--nonseg-cutoff NONSEG_CUTOFF] [--thinning k] [-w W] [--em-iterations EM_ITERATIONS]
                [--algorithm {Powell,L-BFGS-B,TNC}] [--multi] [--ftol FTOL] [--xtol XTOL] [--Nmax NMAX] [--Nmin NMIN]
                [--regularization-penalty REGULARIZATION_PENALTY] [--unfold | --polarization-error p] [--knots KNOTS]
                [--spline {cubic,pchip,piecewise}] [-r R] [--folds FOLDS] [--fold FOLD]
                mu data [data ...]

positional arguments:
  data                  data file(s) in SMC++ format

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase debugging output, specify multiply times for more
  --cores CORES         Number of worker processes / threads to use in parallel calculations
  -o OUTDIR, --outdir OUTDIR
                        output directory
  --base BASE           base for file output. outputted files will have the form {base}.final.json, etc.
  --timepoints TIMEPOINTS TIMEPOINTS
                        start time of model (in generations)
  --unfold              use unfolded SFS (alias for -p 0.0)
  --polarization-error p, -p p
                        uncertainty parameter for polarized SFS: observation (a,b) has probability [(1-p)*CSFS_{a,b} +
                        p*CSFS_{2-a,n-2-b}]. default: 0.5
  --folds FOLDS         number of folds for cross-validation
  --fold FOLD           run a specific fold only, useful for parallelizing

data parameters:
  --nonseg-cutoff NONSEG_CUTOFF, -c NONSEG_CUTOFF
                        recode nonsegregating spans > cutoff as missing. default: do not recode.
  --thinning k          only emit full SFS every <k>th site. (k > 0)
  -w W                  window size. sites are grouped into blocks of size <w>. each block is coded as polymorphic or
                        nonpolymorphic. the default w=100 matches PSMC. setting w=1 performs no windowing but may be more
                        susceptible to model violations, in particular tracts of hypermutated sites.

Optimization parameters:
  --em-iterations EM_ITERATIONS
                        number of EM steps to perform
  --algorithm {Powell,L-BFGS-B,TNC}
                        optimization algorithm. Powell's method is used by {P,MS}MC and does not require gradients. It may be
                        faster in some cases.
  --multi               update multiple blocks of coordinates at once
  --ftol FTOL           stopping criterion for relative improvement in loglik in EM algorithm. algorithm will terminate when
                        |loglik' - loglik| / loglik < ftol
  --xtol XTOL           x tolerance for optimizer. optimizer will stop when |x' - x|_\infty < xtol
  --Nmax NMAX           upper bound on scaled effective population size
  --Nmin NMIN           lower bound on scaled effective population size
  --regularization-penalty REGULARIZATION_PENALTY, -rp REGULARIZATION_PENALTY
                        regularization penalty

Model parameters:
  --knots KNOTS         number of knots to use in internal representation
  --spline {cubic,pchip,piecewise}
                        type of model representation (smooth spline or piecewise constant) to use

Population-genetic parameters:
  mu                    mutation rate per base pair per generation
  -r R                  recombination rate per base pair per generation. default: estimate from data.

Thanks, Brock

terhorst commented 5 years ago

Ah, sorry, misread your comment, I thought you meant the help does not show the --Nmax or --Nmin options at all. I'm not sure why cv is ignoring the options, and will have to look into it. Does smc++ estimate respect them?

twooldridge commented 5 years ago

No, it looks like smc++ estimate still has the same upper limit:

(py36) [twooldridge@holylogin01 new]$ head -n 5 BW_Est.csv 
label,x,y,plot_type,plot_num
BW,0.0,9433962.264150944,path,0
BW,1000.0,9433962.264150944,path,0
BW,1115.4865637172875,9433962.264150944,path,0
BW,1244.3102738338014,9433962.264150944,path,0

I've attached the debug file again, in case it's useful. There's much more output versus when I use the cv command and average over the folds.

Thanks, Brock

BW.est.debug.txt

mariels commented 4 years ago

@twooldridge I would be interested in running bootstrap replicates of my Ne and split times estimates. Could you please let me know how you have proceeded? Thank you very much.

bioinfowheat commented 4 years ago

Hello, I am getting the same upper limit problems, and I find that smc++ cv is not respsecting --Nmax at all. Did this ever get resolved? I'm running the lastest version off of github as of 2 months ago.

BioMatt commented 4 years ago

I am experiencing the same issue with cv not respecting --Nmax in the latest version, as well. Is there a solution out there?

lucasrocmoreira commented 4 years ago

I'm having the same problem, I would appreciate a lot if anyone could find a solution for this.

bioinfowheat commented 4 years ago

Morning Lucas,

I never found a solution. What was recommended by Terhorst to me and others was to try using estimate rather than cv, in the hopes that Nmax would work under those conditions. For me, this didn’t really have an effect.

that was about 3 weeks ago and I have a previous analysis last year from running PSMC on the data, which it gives a very different result from the SMCPP run. I’ve been trying to figure out why and here are my current thoughts. I’d love to hear how these might compare with your system/data/results

1) I’ve got two populations and I know that each of their Ne should be < 1e6, and that they should have responded to the last glacial maximum differently, and should be dramatically different in current Ne.

the old PSCM analysis that was run showed exactly what I’d expect, while the smcpp run has Ne > 1e7 with ~current Ne pretty off.

when I put everything together with a bit of pondering, I’ve got some confounding effects: 1) the PSMC dataset was filtered such that each individual was filtered for SNPs separately. 2) the smcpp dataset was filtered jointly, such that only shared high quality sites were considered.

The two populations have likely hybridized in the last 7k generations. So my current thoughts are this.

1) the PSMC run is working well because I’m using each population's unique SNPs for assessing the SFS 2) the SMCPP run is giving strange results, because I’m giving it only SNPs that remain after filtering, such that the population specific SNPs are getting removed, and only the shared SNPs remain.

thus. I think that the SMCPP is enriched for SNPs resulting from the introgression, which is dramatically boosting my Ne.

when I have time, I’m going to go back and try and use the old dataset from the PSCM run in the SMCPP software to confirm this. I’m also going to do a new round of dataset generation where I explicitly filter at the population level.

In sum, I think SMCPP is giving back the right results, for the samples I’m giving it, since the dataset is likely enriched for SNPs that reflect the introgression and thus should have a much higher Ne. Stated a different way, because the SMCPP dataset is reduced for within population SNPs that don’t have the effects of introgression, my Ne isn’t as small as it should be. I should further state that when I look at Fst between the populations using PoolSeq data, much of the genome is low Fst, but there are many regions having very high Fst (these high Fst regions are likely getting filtered out in the SMCPP run) - also, one of the populations is rather divergent from my refernece, and its unique alleles are likely getting dropped due to my mapping as divergent alleles likely don’t map as well. Thus I need to use a mapper such as NextGenMap that is robust to divergence (i.e. will keep these high Fst regions).

I hope this all makes sense and has something in common with what you’re experiencing. If so, this might open an opportunity to actually write a short paper about these methods in light of introgression/mapping/filtering, as quantifying Ne estimates in light of these effects would benefit the community (and give us a chance to publish upon our bioinformatic adventures).

Cheers, Chris

On Apr 22, 2020, at 5: 09, Lucas notifications@github.com wrote:

I'm having the same problem, I would appreciate a lot if anyone could find a solution for this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/popgenmethods/smcpp/issues/115#issuecomment-617521993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNJ7KTFZHCNO3X2H754EIDRNZNVFANCNFSM4G7SQDMQ.

twooldridge commented 4 years ago

Hi Chris,

Thank you for the very informative and detailed reply. After reading this, I suspect that some form gene flow might also be influencing the Ne cap I'm seeing. Regardless of the cause, I still have yet to find a reason as for why both the estimate and cv commands are showing the same effect. I'd like to hear if anyone has found an effective way of dealing with this. Thanks!

Best, Brock

popgenmethods / smcpp

Output shows same upper limit on Ne across populations & bootstrap replicates #115