kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

diploSHIC detects too many soft / softlinked regions in the genome #42

Open melop opened 2 years ago

melop commented 2 years ago

I first estimated rho, gene conversion rate and track lengths using mlrho and following the method in Lynch et al (2014; Genetics) from my data consisted of 20 unphased diploid individuals (40 chromsomes). The demographic history was estimated with MSMC2, from 10 unphased individuals, with very recent and ancient histories trimmed since they are not accurately inferred. The per base pi for our species is only around 0.001, which resulted in very few segregating sites in a 2kb subwindow, which was the size used in the Anopheles example. I thus used a 20kb subwindow size instead. After simulating with discoal and training, the confusion matrix (attached) shows that false positive rate for calling neutral as non-neutral (mainly soft and soft-linked) is about 12%. Applying the trained model on the real data, out of 29253 20kb subwindows, diploSHIC classified 8674 as soft, 20217 as linkedSoft, 46 as hard, 234 as linkedHard, and only 82 neutral. I think these numbers don't look real and may suggest a mismatch between the simulated data and my real dataset. I would appreciate any pointers. image

andrewkern commented 2 years ago

hey there-- can you share your complete simulations parameters for soft sweeps? i.e. the command line you are using. thanks.

melop commented 2 years ago

Sure: simsoft.sh ` mkdir -p exampleApplication cd exampleApplication

arrSelPos=( 0.045454545454545456 0.13636363636363635 0.22727272727272727 0.3181818181818182 0.4090909090909091 0.5 0.5909090909090909 0.6818181818181818 0.7727272727272727 0.8636363636363636 0.9545454545454546 )

for ((i = 0; i < ${#arrSelPos[@]}; ++i)); do nPos=${arrSelPos[$i]} if [ -f soft${i}.msOut.gz.diploid.fvec ];then echo skip else discoal 40 2000 220000 -Pt 120.26852 360.80556 -Pre 147.7131988 1477.131988 -g 0.933432294 8674.13043355481 -en 7.09013424e-08 0 1 -en 0.01312205314 0 0.837882528616645 -en 0.01763753208 0 0.851573738285996 -en 0.023312474794 0 0.969786293294031 -en 0.030444590092 0 1.15371613098542 -en 0.039408323402 0 1.27013796790701 -en 0.050673697942 0 1.2469086992679 -en 0.06483192408 0 1.11588014250159 -en 0.082625764718 0 0.954359530865228 -en 0.1049892298 0 0.816377171215881 -en 0.13309458778 0 0.720400126023945 -en 0.16841721218 0 0.665880382949292 -en 0.21281103774 0 0.647578184186863 -en 0.26860377236 0 0.661351960114393 -en 0.33872402928 0 0.704939323054157 -en 0.42685041014 0 0.777986177565125 -en 0.53760540442 0 0.882390680398447 -en 0.67680233684 0 1.02107754482327 -en 0.85174261886 0 1.19360795547232 -en 1.0716053768 0 1.39016901750973 -en 1.3479303712 0 1.58673872123383 -en 1.695202439 0 1.74482534958698 \ -ws 0 -Pa 250 2500 -Pu 0.000000 0.025 -Pf 0 0.2 -x $nPos \ | gzip -c >soft${i}.msOut.gz & fi done wait `

andrewkern commented 2 years ago

hard, soft, and neutral look decently separable to me eyeing PCAs from these simulation params. one note is that your gene conversion mean track length above seems very large.

output

you might next make a PCA plot of summary statistics from your empirical data and compare that to simulations

melop commented 2 years ago

Thanks for the analysis. I think I might have figured out a potential reason. The samples we have appear to have a pretty high coefficient of kinship distributed around 0.14 as estimated by popkin, but each individual has an inbreeding coefficient near 0. It suggests that these samples are probably related (half sibs?), whose parents are unrelated. I guess the discoal simulations probably represent the parental generation of our samples, but not these samples per se. I am thinking of performing an extra step of simulation by drawing blocks from the discoal simulated haplotypes to form samples, such that the distribution of coefficient of kinship fit our data, before training with diploshic again. Do you have any suggestions on such cases?

jdaron commented 2 years ago

Hi, Following this topic that melop opened up, I also noticed an over abundance of soft and especially linked-soft in my analysis.

Here is the number of windows classified in each of the 5 categories by diploshic for 3 different populations based on 10 different simulation dataset and 10 replicates by simulation, which yield a total of 100replicates.

wilding.rawnbPredictionByClass.w110000.reps.pdf

Looking at a specific example, the insecticide resistance gene Gste in anopheles gambiae, show a similar result. However, I noticed that adding an extra layers of stringency based on the probability of been neutral drastically decrease the amount of soft and linkedSoft present in the region. Bellow are represented the different output I have got, after considering windows with a probabilities higher than the threshold as neutral. The vertical barre represent the genomic coordinate of Gste.

wilding.probaFilter.examplesGste.pdf

Based on those observation I am thinking to consider in my own case windows presenting selective sweep only those having a probabilities of been neutral < 0.001. But I was wondering if my method of having 100replicates and increasing the stringency of been a sweep is relevant.

andrewkern commented 2 years ago

hi @jdaron-- when i see results like this I again think that the baseline demographic model probably isn't a good fit to your empirical data. have you plotted simulations vs empirical summaries at some point?

jdaron commented 2 years ago

Here is the PCA plot of the simulations, and at the opposite of what you show they don't separate nicely. However, I tried to reproduce the same PCA plot with the data from melop and couldn't neither get the same figure that you show. To make sure I am performing the PCA correctly you use as input .fvec data present in the trainingSets folder right ?

LBVwil train_1 allPredClass pca

Here is the complete simulations parameters I used for soft sweeps. discoal 40 2000 110000 -Pt 77.017769 770.177685 -Pre 2117.988635 6353.965905 -en 0.001189 0 0.897678 -en 0.001205 0 0.773860 -en 0.001210 0 0.647825 -en 0.001312 0 0.543280 -en 0.001371 0 0.435829 -en 0.001385 0 0.251009 -en 0.002704 0 0.352406 -en 0.002709 0 0.632063 -en 0.002711 0 0.773860 -en 0.002719 0 0.998812 -en 0.002721 0 1.113136 -en 0.002727 0 1.213178 -en 0.002732 0 1.470345 -en 0.339043 0 1.217539 -en 0.339047 0 1.081744 -en 0.339100 0 0.960444 -en 0.339174 0 0.860333 -en 0.339249 0 0.422776 -en 0.342706 0 0.322176 -en 0.365408 0 0.204408 -en 0.395423 0 0.958280 -ws 0 -Pa 55.012692 2750.634591 -Pu 0 0.050000 -Pf 0 0.100000 -x 0.13636363636363635 > Soft_1.msOut

andrewkern commented 2 years ago

i was not using those features (but one could for sure)-- instead i was using other summary stats of the simulations. what i was suggesting however was to plot simulations against the real, empirical data. that will tell you if your simulations are an adequate representation of your data.

jdaron commented 2 years ago

Thanks for your quick answer, here is the plot of the summary stat for the simulated data (divided into the 5 predClass) against the empirical data. To me, it seems that the simulation for the neutral prediction fit well the observed empirical data. One difference I could spot is for the the tajima D, in the empirical data is observed around 0 suggesting that the population demography is stable and in the simulated data the median is a bit lower than 0. But I am not sure such difference will trigger my pattern of an over representation of linkedSoft and soft sweep.

LBVwil simulation_vs_empirical

andrewkern commented 2 years ago

this is nice @jdaron -- any chance you could plot this in PCA space to see a dimensionality reduced version? also i'm a bit concerned about the soft and linked soft simulations here-- how are those producing nDiplos=0?

jdaron commented 2 years ago

Hi @andrewkern here is the PCA plot of the summary stat plotted above.

LBVwil summaryStat pca

As you mentioned the mean value of the nDiplos stat for soft sweep is 0.08773807 [+/- 0.0004394302] and range from 0.009433962 to 0.117187500, while nDiplos stat for hard sweep range from 1 to 20. For the empirical data nDiplos range from 2 to 32.

andrewkern commented 2 years ago

okay now i'm more confused! the nDiplos stat you are reporting above is then output from diploSHIC using fvecSim mode?

the PCA plot above looks okay to me for hard, hard-linked, and neutral, but something strange happened with your soft sweep simulations and / or feature vector calcs I think. Any chance I could see the fvec files created?

jdaron commented 2 years ago

So I've re-perform some analysis and found the issue on the nDiplos stat. As you mentioned all the stat I am reporting are calculated by diplosSHIC fvecSim (makeFeatureVecsForSingleMsDiploid.py) for which I am avoiding the normalization by normalizeFeatureVec to get the real value. After conducting a more careful inspection of all the fvec present in the folder rawFVFiles I realized the issue in the nDiplos stat were introduced during the creation of the training set by diploSHIC makeTrainingSets. I corrected the issue and now all the stat look normal. Sorry for the troubles.

Here is the boxplot of all the stat and the PCA reduction. Based on the PCA, I have the feeling that the neutral simulation do not fit well enough the right-up part of the empirical data, which may be causing my problem of over-presence of soft/linkedsoft prediction.

LBVwil simulation_vs_empirical LBVwil summaryStat pca

andrewkern commented 2 years ago

hey this looks great @jdaron! glad we were able to find the bug together here. i'm curious what this means for the number of sweeps in the empirical data-- there are clearly a bunch now!

MarySelifanova commented 2 years ago

Hello @andrewkern! I’ve decided to plot simulations against empirical data to understand whether they approximate my data well enough and my resulting PCA plot (attached) looks really bad. After that I’ve decided to look at boxplots (also attached) of all features to find features that ruin the picture. It turned out that H statistics in simulations and in real data are very different. I think that H-statistics in simulations look strange, while in real data they seem to be okay. Could you please help me to understand what could have gone wrong?

I am working with haploid individuals. In order to do PCA I took files of central windows from outStatsDir for simulations and statFile for the empirical data. Sample size in simulations is (accidentally) 2 haplotypes smaller than in real data. Do I understand correctly that this should not have greatly affected the result?

Thank you!

feature_boxplots.pdf PCA_10percent.pdf

andrewkern commented 2 years ago

Hi @MarySelifanova !

Sorry to be so slow to respond to this. I had missed this message.

Looking at this it looks to me like H and Theta_w are both elevated in the 'real' data vs your training set. I agree the PCA plot looks way off too.

I wouldn't think that the sample size in your sims would greatly affect things here. Generally it looks like you might have to increase mutation rate in your sims?