stschiff / msmc-tools

Tools and Utilities for msmc and msmc2
44 stars 17 forks source link

Two subspecies populations don't coalesce further back in the past #50

Open kaede0e opened 1 year ago

kaede0e commented 1 year ago

Hello, I am facing a very unexpected result from MSMC2 and am very confused about what happened. Any insights would be helpful! Lingonberry_cross_coalescence_refgenome_minus.pdf Effective population size looked like this, which was also unexpectedly low on the recent years: Lingonberry_effective_population_size_refgenome_minus.pdf

I have a diploid plant species which has two identified subspecies based on their geographical origin: European and North American. My inputs are one individual per subspecies, and am treating these as two populations (paternal and maternal haplotype from each individual but unphased genome assembly). I want to know when the two subspecies/populations split in time, and so I performed the cross-coalescence analysis, assuming four haplotypes (2 from each subspecies) in total.

I've mapped short-read whole genome sequencing data of ~14X and ~32X coverage, respectively, on the North American genome assembly as the reference genome, called variants, made input mask files for genomic repeats and mappable regions based on sequencing coverage, and performed MSMC2 as follows:

#4. Single population - effective population size estimate with MSMC2. 
for chr in `cat Lingonberry_ragtag.scaffold.chrnames.txt`;
do
    generate_multihetsep.py --chr $chr \
    --mask $INDIR/Lingonberry_minus_F122991_paired_reads_${chr}.mask.bed.gz \
    --mask $INDIR/Lingonberry_minus_F122990_paired_reads_RedCandy_${chr}.mask.bed.gz \
  --negative_mask $INDIR/Lingonberry_minus_excluded_regions.mask.bed \
    $INDIR/Lingonberry_minus_F122991_paired_reads_${chr}.vcf.gz \
    $INDIR/Lingonberry_minus_F122990_paired_reads_RedCandy_${chr}.vcf.gz \
    > $OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt
done

msmc2 -t 1 -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_minus.msmc2 -I 0,1 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt 

msmc2 -t 1 -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_RedCandy.msmc2 -I 2,3 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt

#5. Population separation history - estimate of split between two populations
msmc2 -t 1 -I 0-2,0-3,1-2,1-3 -s -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_minus_RedCandy.msmc2 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt

#6. Combine results for two population coalescence estimate
combineCrossCoal.py $OUTDIR/Lingonberry_minus_RedCandy.msmc2.final.txt $OUTDIR/Lingonberry_minus.msmc2.final.txt \
    $OUTDIR/Lingonberry_RedCandy.msmc2.final.txt > Lingonberry_minus_RedCandy.msmc2.combined.msmc2.final.txt

Do you spot a critical mistake on the code? Why does my cross-coalescence plot look like this?

Thank you, Kaede

stschiff commented 1 year ago

Is the plant selfing? Could the extremely low population size be due to long blocks of homozygosity? That would then also screw up the cross coalescence rate

kaede0e commented 1 year ago

Hmm the plant can self, but the main reproductive strategy is known to be outcrossing in nature... and that's why the results were unexpected. Our (draft) heterozygosity genome-wide plot looks like this: RedCandy_percent_het_masked_vcf.pdf minus_percent_het_masked_vcf.pdf There could be some runs of homozygosity where the heterozygote calls are reduced but I am not convinced yet... The gaps with no points are missing data, when there is no alleles called within that 100kb window.

If the two subspecies are both showing enough sign of inbreeding/homozygosity, could you please give me a brief explanation of why that could screw up the cross coalescence analysis? (I am very new to popgen models, sorry for taking a step-back!)

Thank you for your help.