ksiewert / BetaScan

Genome-wide scan for balancing selection using beta statistic
27 stars 5 forks source link

problems about sample size #10

Closed LiShuhang-gif closed 3 years ago

LiShuhang-gif commented 3 years ago

Hi, I'm having a problem with running BetaScan. My command is:

python BetaScan.py -i LRS_chr3.beta.txt.gz -fold -o LRS_chr3.betascores.txt

and the error I'm receiving:

Error: Sample size must be greater than 3 haploid individuals to make inference, or else theta_beta will always equal theta_watterson's. You may wish to increase the m paramter value to exclude this SNP from being a core SNP.

But actually, my sample size was including 24 individuals, which is more than 3, I think? May I have some help from anybody? Thanks!

ksiewert commented 3 years ago

Hello! This sounds like there's an issue with the input file format. Could you post here or e-mail me (ksiewert@hsph.harvard.edu) a small portion of your input file that causes this error?

LiShuhang-gif commented 3 years ago

Hi, I want to show you the complete process from vcf file to betascores.txt, so we can find out which step went wrong. The first is my VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  A17HanZZ0060    A17HanZZ0139    A17HanZZ0149    A17Ha
nZZ0156    A17HanZZ0175    A17HanZZ0176    A17HanZZ0188    A17HanZZ0197    A17HanZZ0237    A17HanZZ0250    A17HanZZ0268    A1
7HanZZ0293    A17HanZZ0338    A17HanZZ0399    A17HanZZ0498    A17HanZZ0777    A17HanZZ0788    A17HanZZ0871    A17HanZZ0925
    A17HanZZ0938    A17HanZZ0952    A17HanZZ0953    A17HanZZ0954    A17HanZZ0973
chrY    2793653 .       A       G       7831.19 PASS    .       GT:AD:DP:GQ:PL  1/1:0,20:20:60:678,60,0 1/1:0,18:18:54:681,54
,0 ./.:0,0:0:.:0,0,0       1/1:0,23:24:69:848,69,0 1/1:0,22:22:66:814,66,0 1/1:0,20:20:60:762,60,0 ./.:0,0:0:.:0,0,0       1/
1:0,19:19:57:662,57,0 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       1/1:0,1
4:14:42:451,42,0 ./.:0,0:0:.:0,0,0       1/1:0,12:12:36:468,36,0 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:
0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       1/1:0,25:25:75:879,75,0 1/1:0,15:15:45:59
1,45,0 1/1:0,24:24:71:791,71,0
chrY    2794764 .       T       C       1668.74 PASS    .       GT:AD:DP:GQ:PL  0/0:23,0:23:63:0,63,753 0/0:24,0:24:66:0,66,9
90 ./.:0,0:0:.:0,0,0       1/1:0,22:22:66:771,66,0 0/0:14,0:14:42:0,42,507 1/1:0,27:27:80:916,80,0 ./.:0,0:0:.:0,0,0       0/
0:17,0:17:51:0,51,606 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       0/0:18,
0:18:54:0,54,587 ./.:0,0:0:.:0,0,0       0/0:21,0:21:57:0,57,855 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:
0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       0/0:16,0:16:45:0,45,675 0/0:15,0:15:45:0,
45,566 0/0:15,0:15:45:0,45,512
chrY    2799885 .       T       C       1559.74 PASS    .       GT:AD:DP:GQ:PL  1/1:0,27:27:81:958,81,0 0/0:23,0:23:66:0,66,9
90 ./.:0,0:0:.:0,0,0       0/0:19,0:19:57:0,57,674 0/0:13,0:13:36:0,36,540 0/0:16,0:16:48:0,48,553 ./.:0,0:0:.:0,0,0       0/
0:18,0:18:51:0,51,765 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       1/1:0,1
7:17:51:620,51,0 ./.:0,0:0:.:0,0,0       0/0:15,0:15:45:0,45,570 ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:
0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       ./.:0,0:0:.:0,0,0       0/0:14,0:14:39:0,39,585 0/0:30,0:30:90:0,
90,1117        0/0:20,0:20:60:0,60,723

I converted the VCF file to an ACF.GZ file and then to a BETA.TXT.GZ file based on glactools using the following two commands.

glactools vcfm2acf --onlyGT --fai /public/home/fan_lab/shali/reference/hg38_22_XYM.fa.fai NGS_chrY.vcf > NGS_chrY.acf.gz
glactools acf2betascan --fold NGS_chrY.acf.gz | bgzip -c > NGS_chrY.beta.txt.gz

The resulting beta.txt.gz file looked like this:

10029   1       24
11436   8       38
11844   3       48
12103   8       48
12104   0       48
12133   3       48
12210   1       48
12250   2       48
12252   19      48
12253   10      48
12269   3       48
12393   12      48

Finally, I used Betascan to calculate the beta value, using the following command:

python /public/home/fan_lab/shali/BetaScan/BetaScan/BetaScan.py -i NGS_chrY.beta.txt.gz -fold -o NGS_chrY.betascores.txt

Actually, I did get a result file named NGS_chrY.betascores.txt, and it looked like this:

Position        Beta1*
2794764 0.0
2799885 0.0
2818381 0.205531
2818686 0.205531
2825426 0.0
2838636 0.0

But when I checked the log file, I found an error message, which I didn't know how to solve it:

Error: Sample size must be greater than 3 haploid individuals to make inference, or else theta_beta will always equal theta_watterson's. You may wish to increase the m paramter value to exclude this SNP from being a core SNP.

Any ideas? Thanks one more time!

ksiewert commented 3 years ago

When I run BetaScan on that portion of beta.txt.gz you gave me, it works fine. Could you send me the portion of beta.txt.gz that is causing the error? It should be the portion corresponding to the SNPs around where the last BetaScores in the results file are given. Basically BetaScan is outputting until it reaches the problematic line, then stops reporting results. If I had to bet, somewhere you have a line where the 3rd column has a value of 3 or less. Have you checked whether this is the case?

LiShuhang-gif commented 3 years ago

Yes! You are right. When I check the problematic lines of beta.txt.gz file, it looks like this:

10987935        1       2
10987939        1       2

The third column is actually less than 3. Considering that my purpose is to use Betascan to detect the balancing selection of my VCF files, if I tried to remove the problematic lines in beta.txt.gz file and then run Betascan again, would this cause any negative effect on the final results?

ksiewert commented 3 years ago

I would remove these SNPs and disregard any results for core SNPs within a window size of these two SNPs. But these SNPs won't affect the results for any SNPs that aren't in the same window.

LiShuhang-gif commented 3 years ago

Thanks. I still have a question which may be too basic. But I really want to know how to determine which SNPs are in the same window as these problematic two SNPs. if I could find them, I would disregard them.

ksiewert commented 3 years ago

The default window size is 1,000 basepairs (or 500 bp on either side of the core SNP). It's probably best to just remove all SNPs within 1,000 bp of these SNPs.