popgenmethods / smcpp

SMC++ infers population history from whole-genome sequence data.
GNU General Public License v3.0
149 stars 34 forks source link

Partially selfing organisms #82

Closed cstritt closed 5 years ago

cstritt commented 6 years ago

Hi Jonathan, I am studying a partially selfing plant species (typical selfing rate ~ 0.7) and am wondering how to best pre-process the SNP data for using SMC++.

At present I see three options: 1) Treat individuals as haploid and combine random individuals into fake diploids, as people do when using MSMC with the highly selfing Arabidopsis thaliana. All information on individual heterozygosity is lost in this case, which seems a pity because heterozygosity is substantial in our organism.

2) Create fake diploids by first phasing and then randomly combining phased chromosomes. This way all the variable sites would be used, at the cost of introducing phasing artefacts.

3) Leave the data as they are. As far as I know accounting for selfing in the ARG requires a rescaling of time in units of 2N/(1+F) and replacing the recombination rate R by R(1-s). Could this be done with a few changes in the source code, or are there complications I ignore?

Thanks! Christoph

mishaploid commented 6 years ago

Hi Jonathan and Cristoph,

Any updates on this? We have a similar problem and have tried options 2 and 3 (without corrections for time and recombination rate). The results are super strange when we use the fake diploids (some populations end up at Ne=1), but we may also be having issues with long runs of homozygosity in the data.

Thanks!

Best, Sarah

terhorst commented 6 years ago

The recombination rate is estimated by default from the data, so option 3) makes the most sense to me. Post hoc, you would need to manually rescale the estimated Ne by whatever your estimate for F is. I have not ever tried running SMC++ (or any other demographic inference method) on selfing organisms, so I would also advise verifying that this produces reasonable estimates on simulated data.

oushujun commented 6 years ago

Hi Jonathan and Christoph,

For option 3, could you confirm if F stands for selfing rate and s stands for selection coefficient?

My study organism has very high selfing rate (>90%) and homozygosity. To control for false mapping rate, basically, I throw away all heterozygous with some other filtering. The input data is phased and imputed so no missing data in the input vcf files.

I installed the SMC++ v1.14.0.dev0 via conda as instructed in the Readme.

Then, I encounter some warnings like this when running smc++ estimate: 19996 smcpp.estimation_tools WARNING Long runs of homozygosity in contig Chr11.Or1.smc.gz: [552378 81563 244946] (base pairs)

I believe these are high-confident homozygosities but not artifacts from missing data. In this case should I worry about this kind of wanrings?

Thanks! Shujun

cstritt commented 6 years ago

Hi Shujun, F stands for the inbreeding coefficient and s for the selfing rate. The rescalings above are from Nordborg & Donnelly 1997 and Nordborg 2000, where the coalescent with recombination and selfing is very well explained. Best, Christoph

oushujun commented 6 years ago

Thanks @cstritt !!

oushujun commented 5 years ago

@cstritt Based on Nordborg 2000, you just need to use the "effective mutation rate" instead of mutation rate to do smc++ estimate: mu_eff = mu/(1 + F), where F stands for inbreeding coefficient. As @terhorst mentioned, recombinatin rate will be estimated based on data, so you may not abe able to use the "effective recombination rate" here. Instead, you may use the original mu to estimate, then scale the resulting generation (or time) with Time_scaled = Time_ori (1 + F*).

terhorst commented 5 years ago

Closing as I think this has been resolved.

attrna commented 4 years ago

Hi!

@oushujun, do you recommend to do both of these scalings (the effective mutation rate and the generation time) or only one of them? I'm studying a population of selfers (C. elegans). If to do a scaling of the mutation rate, should I also adjust the recombination rate (e.g., set it to ~0 with "-r")?

Thank you! Anastasia

oushujun commented 4 years ago

Hi @attrna, based on my understanding, you need to scale both mutation rate and the generation time with F. Since the recombination rate is estimated by the program (at lease for the version I used), I don't know how to scale it but I think scaling the generation time at the end should work. As @terhorst suggested, a confirmation case can be made by stimulation and yet I have not done that...

attrna commented 4 years ago

Thank you!