kr-colab / diploSHIC

feature-based deep learning for the identification of selective sweeps
MIT License
50 stars 14 forks source link

Encountering the problem of too few SNPs #27

Closed jtxtina closed 4 years ago

jtxtina commented 4 years ago

When I finished training with default totalPhysLen:1100000 and numSubWins:11, I tried to predict using a single sample from human vcf file. But during vcf->fvec process, it seems that for human chromosomes, SNPs are just too few such that there are few "good" subwindows to write out statistics, which results in empty diploid.fvec file. Any suggestion to improve this?

andrewkern commented 4 years ago

I tried to predict using a single sample from human vcf file.

do you mean a single human genome here?

jtxtina commented 4 years ago

I tried to predict using a single sample from human vcf file.

do you mean a single human genome here?

Yes. It's a vcf file containing a single human's gene (from chr1 - chrx/y, etc.). And I tried to "fvecVcf" it. I have tried the choice of "chrArm" = 21, 3, 10. But the common problem is that it seems that SNPs are just not enough for the model to find enough good subwindows to output statistics. So usually, after "fvecVcf", it would be either empty or very few lines.

jtxtina commented 4 years ago

And when I trained a numSubWins=5 version model, when "fvecvcf', there are always 50 statistics value being 0.2.

E.g. chrom classifiedWinStart classifiedWinEnd bigWinRange pi_win0 pi_win1 pi_win2 pi_win3 pi_win4 thetaW_win0 thetaW_win1 thetaW_win2 thetaW_win3 thetaW_win4 tajD_win0 tajD_win1 tajD_win2 tajD_win3 tajD_win4 distVar_win0 distVar_win1 distVar_win2 distVar_win3 distVar_win4 distSkew_win0 distSkew_win1 distSkew_win2 distSkew_win3 distSkew_win4 distKurt_win0 distKurt_win1 distKurt_win2 distKurt_win3 distKurt_win4 nDiplos_win0 nDiplos_win1 nDiplos_win2 nDiplos_win3 nDiplos_win4 diplo_H1_win0 diplo_H1_win1 diplo_H1_win2 diplo_H1_win3 diplo_H1_win4 diplo_H12_win0 diplo_H12_win1 diplo_H12_win2 diplo_H12_win3 diplo_H12_win4 diplo_H2/H1_win0 diplo_H2/H1_win1 diplo_H2/H1_win2 diplo_H2/H1_win3 diplo_H2/H1_win4 diplo_ZnS_win0 diplo_ZnS_win1 diplo_ZnS_win2 diplo_ZnS_win3 diplo_ZnS_win4 diplo_Omega_win0 diplo_Omega_win1 diplo_Omega_win2 diplo_Omega_win3 diplo_Omega_win4

21 33850001 33950000 33700001-34200000 0.18181933884703225 0.0909087603268219 0.3636350413072876 0.27272628098046564 0.09091057853839266 0.18181933884703225 0.0909087603268219 0.3636350413072876 0.27272628098046564 0.09091057853839266 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

21 43250001 43350000 43100001-43600000 0.1358573562739464 0.18485133361474532 0.09057157084929761 0.04528487970894031 0.5434348595530702 0.1358573562739464 0.18485133361474532 0.09057157084929761 0.04528487970894031 0.5434348595530702 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

21 45150001 45250000 45000001-45500000 0.18750082031960458 0.18749894529265113 0.5000021875189457 0.062499023434399406 0.062499023434399406 0.18750082031960458 0.18749894529265113 0.5000021875189457 0.062499023434399406 0.062499023434399406 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

andrewkern commented 4 years ago

do you mean a single human genome here?

Yes. It's a vcf file containing a single human's gene (from chr1 - chrx/y, etc.).

diploSHIC isn't meant to be run on a single genome-- it calculates summaries from population level samples, for instance n=10+ genomes

jtxtina commented 4 years ago

do you mean a single human genome here?

Yes. It's a vcf file containing a single human's gene (from chr1 - chrx/y, etc.).

diploSHIC isn't meant to be run on a single genome-- it calculates summaries from population level samples, for instance n=10+ genomes

I see. And how about numSubWins? What are your standards for choosing it for value other than 11? for fvecvcf, train, fvecsim....

jtxtina commented 4 years ago

Ok I used multiple samples and it worked perfectly. Thank you