popgenmethods / smcpp

SMC++ infers population history from whole-genome sequence data.
GNU General Public License v3.0
149 stars 34 forks source link

Confusion about the input of `split` #189

Open QianXiaobo opened 3 years ago

QianXiaobo commented 3 years ago

Hi terhorst,

I am a little confused whether data/*.smc.gz consisting of tag1, 2, 3, and 4 in the picture, total four (multiply by the numble of chromosomes) files or not? Or data/*.smc.gz just includes data/pop12.chr*.smc.gz and data/pop21.chr*.smc.gz?

image

Hope for your reply.

Best wishes, Xb

pstokespmb commented 2 years ago

@QianXiaobo were you ever able to figure this out?

Callithrix-omics commented 1 year ago

yeah this is a good question. I was wondering about this too... did anyone ever get a solution? right now I have been running split as

smc++ split -o split/ pop1/model.final.json pop2/model.final.json data/pop1_pop2*.smc.gz
smc++ split -o split/ pop2/model.final.json pop1/model.final.json data/pop2_pop1.*.smc.gz

as an additional comment, I would expect the resulting plots to be mirror images of each other but there are differences. how are others structuring their split command?

pstokespmb commented 1 year ago

@Callithrix-omics, were you able to reach a solution? I am just now looping back to this analysis with a rebuilt dataset.

@terhorst, could you provide some insight on this? The smc++ writeup suggests 4 SFS inputs (x #chrom). 1dSFS for each population, and 2dSFS for each population-order-combination. But in the issues section, it seems hazy still

Callithrix-omics commented 1 year ago

@pstokespmb I ran it both ways and essentially got the same results

pstokespmb commented 1 year ago

@Callithrix-omics Thanks for the quick reply! Much appreciated. I updated my question since your response here. Did you end up inputting all 4 sources of vcf2smc output? Or just the input of the estimate models and one set of 2dSFS vcf2smc output?

Callithrix-omics commented 1 year ago

@pstokespmb I tried it both ways with the one run with the set of 2dSFS vcf2smc and then another separate run with the 4 sources and the results came out the same for me. so I guess same difference. I did run each version of split 3 times and there was slight variability between runs but nothing major

pstokespmb commented 1 year ago

@Callithrix-omics I REALLY appreciate you for this! Thank you :).

I do think this should be more explicitly stated in the readme if possible, @terhorst

Callithrix-omics commented 1 year ago

yeah I agree with you @pstokespmb. good luck with your analyses.

ndussex commented 3 weeks ago

Hi all,

I may have to follow up on your question here as I am also confused.

I initially ran the analyses as suggested in the documentation (with the exception that I only looped over the chromosomes) such as:

##########################

1. Run smc++ vcf2smc

##########################

mkdir results_divergence myvcf='myfile.vcf.gz'

Pop1='Pop1:Pop1_ind1, Pop1_ind2, Pop1_ind3' LINES=$(cat 20_chrom_list.txt) for i in $LINES do smc++ vcf2smc "$myvcf" resultsdivergence/"Pop1.smc"$i".gz" "$i" "$Pop1" --ignore-missing -c 1000000 done

Pop2='Pop2:Pop2_ind1, Pop2_ind2, Pop2_ind3' LINES=$(cat 20_chrom_list.txt) for i in $LINES do smc++ vcf2smc "$myvcf" resultsdivergence/"Pop2.smc"$i".gz" "$i" "$Pop2" --ignore-missing -c 1000000 done

####################################

2. make joint SFS

####################################

Pop1

LINES=$(cat 20_chrom_list.txt) for i in $LINES do smc++ vcf2smc "$myvcf" results_divergence/"Pop1Pop2.smc"$i".gz" "$i" "$Pop1" "$Pop2" --ignore-missing -c 1000000 done

Pop2

LINES=$(cat 20_chrom_list.txt) for i in $LINES do smc++ vcf2smc "$myvcf" results_divergence/"Pop2Pop1.smc"$i".gz" "$i" "$Pop2" "$Pop1" --ignore-missing -c 1000000 done

####################################

3. estimate each pop. marginally and 4. marginal estimate of joint demography

####################################

smc++ estimate --base Pop1 -o Pop1_out/ 1.06e-8 Pop1. -c 100000 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 --thinning 1300 --em-iterations 1000 smc++ estimate --base Pop2 -o Pop2_out/ 1.06e-8 Pop2. -c 1000000 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 --thinning 1300 --em-iterations 1000

joint estimation

smc++ split -o Pop1_Pop2_out Pop1_out/Pop1.final.json Pop2_out/Pop2.final.json .smcgz -c 1000000 --thinning 1300 --em-iterations 10000 --regularization-penalty 6 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 smc++ plot Pop1_Pop2_joint_years.pdf Pop1_Pop2_out/model.final.json -g 6 --xlim 1 500000 -c

Howver, now I wonder if I should also loop over my samples, such as:

##########################

1. Run smc++ vcf2smc

##########################

Pop1='Pop1:Pop1_ind1, Pop1_ind2, Pop1_ind3'

for c in {Chrom1, Chrom2} do for s in {Pop1_ind1, Pop1_ind2, Pop1_ind3}; do smc++ vcf2smc "$myvcf" resultsdivergence/"Pop1.smc"$c"_"$s".gz" $c -d $s $s "$Pop1" --ignore-missing -c 1000000 done done

Pop2='Pop2:Pop2_ind1, Pop2_ind2, Pop2_ind3' for c in {Chrom1, Chrom2} do for s in {Pop2_ind1, Pop2_ind2, Pop2_ind3}; do smc++ vcf2smc "$myvcf" resultsdivergence/"Pop2.smc"$c"_"$s".gz" $c -d $s $s "$Pop2" --ignore-missing -c 1000000 done done

####################################

2. make joint SFS

####################################

for c in {Chrom1, Chrom2} do for s in {Pop1_ind1, Pop1_ind2, Pop1_ind3, Pop2_ind1, Pop2_ind2, Pop2_ind3}; ## not sure I need the samples for pop2 or not do smc++ vcf2smc "$myvcf" results_divergence/"Pop1Pop2.smc"$c"_"$s".gz" $c -d $s $s "$Pop1" "$Pop2" --ignore-missing -c 1000000 done done

for c in {Chrom1, Chrom2} do ffor s in {Pop1_ind1, Pop1_ind2, Pop1_ind3, Pop2_ind1, Pop2_ind2, Pop2_ind3}; ## not sure I need the samples for pop1 or not do smc++ vcf2smc "$myvcf" results_divergence/"Pop2Pop1.smc"$c"_"$s".gz" $c -d $s $s "$Pop2" "$Pop1" --ignore-missing -c 1000000 done done

####################################

3. estimate each pop. marginally and 4. marginal estimate of joint demography

####################################

smc++ estimate --base Pop1 -o Pop1_out/ 1.06e-8 Pop1. -c 100000 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 --thinning 1300 --em-iterations 1000 smc++ estimate --base Pop2 -o Pop2_out/ 1.06e-8 Pop2. -c 1000000 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 --thinning 1300 --em-iterations 1000

joint estimation

smc++ split -o Pop1_Pop2_out Pop1_out/Pop1.final.json Pop2_out/Pop2.final.json .smcgz -c 1000000 --thinning 1300 --em-iterations 10000 --regularization-penalty 6 --polarization-error 0.5 --ftol 1e-7 --xtol 1e-7 smc++ plot Pop1_Pop2_joint_years.pdf Pop1_Pop2_out/model.final.json -g 6 --xlim 1 500000 -c

your help would be greatly appreciated! :)