kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
49 stars 14 forks source link

Suggestions to improve training accuracy? #10

Closed oushujun closed 5 years ago

oushujun commented 5 years ago

I followed the mosquito guide, simulations were done using population parameters of my data.

Below is an example of my hard sweep simulation (soft sweep and neutural were similar): discoal 200 2000 55000 -Pt 89.43 894.3 -Pre 556.6 1669.8 -Pa 2000 200000 -Pu 0.0000 0.025 -ws 0 -en 0.0001 0 0.5 -en 0.0016 0 0.2 -en 0.0032 0 0.01 -en 0.006 0 0.03 -en 0.04 0 10 -x 0.22727272727272727 > hard_2_msOut

However the training accuracy was very low:

Epoch 00007: val_acc did not improve from 0.35100 Epoch 00007: early stopping total time spent fitting and evaluating: 2713.370000 secs evaluation on test set: diploSHIC loss: 1.417434 diploSHIC accuracy: 0.335000

Then I thought maybe the severe bottleneck is messing up the selection signature in such a short sequence (5Kb), so I scaled up the length to 30K (which eats up to 1200 GB of memory by the way) Sample command for hard sweep: discoal 200 2000 330000 -Pt 536.58 5365.8 -Pre 3339.6 10018.8 -Pa 2000 200000 -Pu 0.0000 0.025 -ws 0 -en 0.0001 0 0.5 -en 0.0016 0 0.2 -en 0.0032 0 0.01 -en 0.006 0 0.03 -en 0.04 0 10 -x 0.5 > hard_5_msOut2

But the training accuracy does not improve much:

Epoch 00013: val_acc did not improve from 0.38400 Epoch 00013: early stopping total time spent fitting and evaluating: 7415.870000 secs evaluation on test set: diploSHIC loss: 1.377291 diploSHIC accuracy: 0.392000

Since the time range I am simulating is very large (-Pu 0.0000 0.025, up to 10K generations), would it be that the simulation replicates (2000 replications) are not enough to sufficiently cover this range during the training? Now I am trying to simulate 20000 and 50000 replications for two independent trials and have not get the result yet. Could you provide some suggestions or insights?

Thanks, Shujun

andrewkern commented 5 years ago

rather than start by training a classifier you should start with examining your simulations-- does the scenario / input parameters yield a significant effect on patterns of diversity that you can observe? For instance what is the average loss in heterozygosity associated with sweeps as you have specified them?

oushujun commented 5 years ago

That is a good point! I check the hard sweep simulation and found all 11 windows has pi dropped down to 0.1, which is only the case for the selected window in the mosquito example. I used the concept of "effective mutation/recombination rate"=(u or rho)/(1+F), where F is inbreeding coefficient, to resemble the case of selfing, so presumably it has almost 0 heterozygosity. I will try more recent time range and see if the simulations signals are distinguishable. Thanks for the insight!

Shujun