kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

Hard and Soft sweeps are similar? #18

Closed SBNoor closed 5 years ago

SBNoor commented 5 years ago

I've simulated a dataset using discoal. On the raw data alone, my sweeps seems to be working that is number of segregating sites in hard sweep are skewed towards the left for example. However, when I make the feature vectors the mean normalised values of my summary statistics do not make sense. For example:

screenshot 2019-03-08 at 10 08 36 screenshot 2019-03-08 at 10 08 25

And because my soft and hard sweeps are so similar, the neural network can't learn anything and therefore, the training and test accuracy is super low i.e. less than 50%

When I am simulating the dataset, I don't specify demographic history and when making feature vectors I don't use a masked file since they were optional. Do you think that could the reason why my sweeps look so similar. Adding these would help me improve my result. Neutral sweeps look fine though. Could you provide some insights?

andrewkern commented 5 years ago

can you provide representative command lines to discoal that you are using for these simulations? also tagging @dschride on this conversation in case he has bandwidth to help troubleshoot

SBNoor commented 5 years ago

Command line for a hard sweep: ./discoal 60 2000 220000 -Pt 40 400 -Pre 183 550 -ws 0 -Pa 41 833 -Pu 0 0.12 -x 0.13636363636363635

Command line for a soft sweep: ./discoal 60 2000 220000 -Pt 40 400 -Pre 183 550 -ws 0 -Pa 41 833 -Pu 0 0.12 -Pf 0.0 0.2 -x 0.13636363636363635

andrewkern commented 5 years ago

so the figures above look like they make sense-- hard sweeps have a deeper valley of polymorphism for instance-- however it could be that you are exploring a part of parameter space where hard and soft sweeps are hard to tell apart. for instance you are including quite old sweeps in your distribution of sweeps times ~U(0,0.12)-- old sweeps will be very hard to detect and older hard sweeps actually can look like soft sweeps. Also you are specifying your mutation rate to be very very low-- theta / bp is 0.001. This will mean after a sweep there should be little to no variation at a given locus.

SBNoor commented 5 years ago

Thank you for your response. It seems, perhaps I've haven't understood as to how the parameters are scaled. Because I want my mutation rate per bp per generation to be 1.5e-8 with current population size as 17000. Could you tell me how this parameter is scaled then? Also, does sample size in discoal refers to # of individuals or # of haplotypes?

andrewkern commented 5 years ago

sorry to be slow in responding here. sample size refers to the number of haplotypes simulated. It looks like you are specifying theta correctly if indeed you wish theta/bp to be 0.001. This is quite low.

SBNoor commented 5 years ago

It's alright. But in article 'Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome', the command line used for PEL population is more or less the same. Over there demographic history is taken into account. For now I don't want to take demographic history into consideration. And, SHIC (a random forest variant) is used to make predictions. I expected that it these parameters work for SHIC then similar parameters would work in diploSHIC or is it not necessary?

andrewkern commented 5 years ago

PEL is an extremely bad example to be working off of. As we show in that paper, there is very little power to detect sweeps in that population.

SBNoor commented 5 years ago

Fair enough. Thank you