stevemussmann / BayesAss3-SNPs

Modification of BayesAss 3.0.4 to allow handling of large SNP datasets
GNU General Public License v3.0
15 stars 7 forks source link

Acceptance rate output question #17

Closed githubgig closed 3 months ago

githubgig commented 1 year ago

Hello,

I did a run with 10M generations and 1M burnin, and I'm trying to adjust parameters for better acceptance rates. I have the following output:

Line 1: logP(M): -150.64 logL(G): -3384698.26 logL: -3384848.90 % done: [0.00] % accepted: (0.26, 0.00, 0.23, 0.00, 0.00)

Line 1010: logP(M): -160.69 logL(G): -3614923.20 logL: -3615083.89 % done: (1.00) % accepted: (0.36, 0.00, 0.33, 0.22, 0.00)

My questions are:

  1. Why are there 1010 lines and what decides this number?
  2. Should the acceptance rate be optimal (20% to 60%) from the very first line or is it expected to happen by the last line?
  3. Lastly, all logP(M) lines show individual migrant ancestries as 0. Is this normal?

Thank you.

stevemussmann commented 1 year ago

The user manual for the original BA3 covers the outputs in detail. In particular, sections 5.1 onwards will be helpful: https://github.com/brannala/BA3/blob/master/doc/BA3Manual.pdf

In summary, 1) I'm not sure which output file you're looking at to answer this question. The number of lines in the mcmc trace file should be a function of the number of generations requested divided by your mcmc thinning interval. Others are going to depend upon the number of samples and loci input to the program. 2) In my experience the acceptance rate usually doesn't fluctuate much throughout the run once you find optimal mixing parameters. However, it's possible that you will never find optimal mixing parameters (see user manual linked above). 3) These lines aren't telling you the individual migrant ancestries. The value you see is the acceptance rate for the individual migrant ancestries. Unfortunately that's one of the acceptance rates that cannot be adjusted (see user manual section 5.2 for more information).

-Steve

githubgig commented 1 year ago

Hi Steve,

Thank for the reply. I have already read the manual.

With regards to 1 and 2: I'm looking at the screen output which has 1010 lines of lopP(M) (see attached file). The inbreeding co-efficient steadily increases from 0 to 22% from line 1 to line 1010. If the acceptance rates shouldn't fluctuate much throughout the run, I'm guessing I need to adjust the parameters so that all lines should have more or less optimal rates from line 1 to line 1010. Is my understanding correct?

Thanks. Screen output.log

stevemussmann commented 1 year ago

Thanks for posting the file.

That's odd that the acceptance rate for for the inbreeding co-efficient changes so much (0 to 22%) over the course of the run. I have run thousands of files through this program and always suspected something like that could happen, but I haven't actually seen it happen. To me it looks like the MCMC still hasn't reached convergence at 10M generations, which is a bit surprising (negative log likelihood values are still decreasing).

Looking at your output I have a couple of suspicions about what might have caused this.

1) 43k loci is a lot. Running that many loci is probably more detrimental to program runtime than it is beneficial to population resolution. I usually recommend people reduce their datasets down to at most a few thousand informative loci, which brings me to my next point... 2) In the bit of the output I pasted below, the first two loci appear to be monomorphic (i.e., locus0:1 and locus1:1). The number after the locus name indicates the number of alleles detected at the locus. Those loci shouldn't influence results either way, but they are wasting compute cycles and slowing down your run.

Individuals: 66 Populations: 4 Loci: 43134 Missing genotypes: 0

Locus:(Number of Alleles)

locus0:1 locus1:1 locus2:3 locus3:3 locus4:4 locus5:4 locus6:4 locus7:4 locus8:4 locus9:4 
locus10:4 locus11:4 locus12:4 locus13:3 locus14:4 locus15:4 locus16:4 locus17:3 locus18:4 locus19:4 
locus20:4 locus21:3 locus22:3 locus23:5 locus24:4 locus25:4 locus26:4 locus27:5 locus28:4 locus29:5 

3) I'm also a bit surprised that the program reported 0 missing genotypes among 43,134 loci. I'm wondering if something went wrong during file conversion to make the input immanc file? I'm seeing some loci show 5 alleles (e.g., locus27, locus29) and I'm wondering if that fifth allele is actually a missing data value that was converted to a value interpreted by BA3 as an allele? If you have missing data that is being erroneously considered as actual genotype data, I'm wondering if that could cause problems with MCMC convergence? Especially if those data are missing randomly rather than as a consequence of a real biological pattern among populations?

githubgig commented 1 year ago

Hi Steve,

Yes, tracer plots show that some of them haven't converged at 10M.

  1. What method do you recommended I use for filtering informative loci?
  2. I checked the file, and it so happens that there are only two monomorphic loci in the entire file (locus0:1 and locus1:1). I will remove these in any case.
  3. I just noticed missing data are coded using -9 instead of 0, but this is due to the original file used for conversion to immanc. I'll rectify this.

Thanks

stevemussmann commented 1 year ago

I don't know what your actual missing data per locus is, but if you have high missing data at some loci you can probably start there. You can probably also filter by minor allele frequency. If you know the locations of your SNPs in your genome I would also filter by distance between SNPs to ensure you're getting independent loci.

stevemussmann commented 3 months ago

I'm closing this issue since it seems to no longer be active. Feel free to re-open, or open a new issue if you experience problems.

-Steve