samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

Bcftools norm prints only the header #2143

Closed alexvasilikop closed 2 months ago

alexvasilikop commented 3 months ago

Hello,

I used bcftools norm option to split multiallelic variants and to normalized indels. However, the output is an empty vcf file and only the header is printed.

I am using bcftools 1.19 + htslib-1.19 Here is the command:

bcftools norm -m-both --fasta-ref "${ref_name}" --threads "${num_threads}" -O v "result/${INPUTVCF%.vcf}.single_sample.sorted.vcf" > "result/${INPUTVCF%.vcf}.normalized.vcf"
alexvasilikop commented 3 months ago

this was caused by wrong input vcf

alexvasilikop commented 3 months ago

Hello,

I tested again and indeed there seems to be an issue that is only for one of my 2 test cases though. This is running in a pipeline in nexflow so ! indicates nexflow variables.

bcftools view -s "!{sample_id}" -O v "${INPUTVCF}" | bcftools sort -O v | \
bcftools norm -m -both --multi-overlaps 0 -f "!{ref_name}" --threads "!{num_threads}" -o "result/${INPUTVCF%.vcf}.normalized.vcf"

The problem seems to be indeed in bcftools norm. In fact what I observed before (printing only the header) was the stdout. However, checking the stderr file there is the follwing indication:

Error: wrong number of fields in INFO/AS_MQRankSum at chr1:1793064, expected 3, found 1

which is the first variant in the original file which contains 3 samples (1 tumor and 2 normal), so I select the tumor and want to normalize. This is how the original file looks:

##contig=<ID=HLA-DRB1*15:03:01:02,length=11569,assembly=Homo_sapiens_assembly38.fasta>
##contig=<ID=HLA-DRB1*16:02:01,length=11005,assembly=Homo_sapiens_assembly38.fasta>
##filtering_status=Warning: unfiltered Mutect 2 calls.  Please run FilterMutectCalls to remove false positives.
##normal_sample=126697
##reference=file:///tmp/nxf.V9ygJBnOTq/Homo_sapiens_assembly38.fasta
##source=LeftAlignAndTrimVariants
##tumor_sample=126695
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  126695  126697  126698
chr1    1793064 .   CA  C,CAA,CAAA  .   .   AS_MQRankSum=0.000;AS_SB_TABLE=240,8|62,0|188,0|27,1;DP=659;ECNT=1;MBQ=37,37,38,38;MFRL=391,416,380,384;MMQ=60,60,60,60;MPOS=32,36,38;MQ0=0;MQRankSum=0.000;NALOD=-4.095e+01,-1.794e+02,-1.139e+01;NLOD=6.69,-1.902e+02,7.26;POPAF=6.00,6.00,6.00;RPA=13,12,14,15;RU=A;SOR=3.330;STR;TLOD=36.58,157.95,14.02    GT:AD:AF:DP:F1R2:F2R1:SB    0/1/2/3:105,30,86,15:0.118,0.372,0.062:236:44,17,47,10:60,12,39,5:103,2,131,0   0/0:76,19,40,8:0.132,0.282,0.060:143:36,9,19,4:38,10,21,4:71,5,67,0 0/0:67,13,62,5:0.082,0.437,0.034:147:39,6,37,4:27,7,25,1:66,1,79,1
chr1    1815530 .   A   T   .   .   AS_MQRankSum=0.000;AS_SB_TABLE=57,35|9,0;DP=120;ECNT=1;MBQ=37,21;MFRL=338,345;MMQ=60,60;MPOS=5;MQ0=0;MQRankSum=0.000;NALOD=-3.343e+00;NLOD=7.04;POPAF=6.00;SOR=3.677;TLOD=3.63  GT:AD:AF:DP:F1R2:F2R1:SB    0/1:39,4:0.105:43:22,3:17,1:25,14,4,0   0/0:37,4:0.111:41:17,3:20,1:23,14,4,0   0/0:16,1:0.044:17:11,1:5,0:9,7,1,0
chr1    1817466 .   A   T   .   .   AS_MQRankSum=0.000;AS_SB_TABLE=522,0|19,0;DP=544;ECNT=1;MBQ=38,24;MFRL=450,340;MMQ=60,60;MPOS=5;MQ0=0;MQRankSum=0.000;NALOD=-7.007e+00;NLOD=69.05;POPAF=6.00;SOR=0.001;TLOD=7.71    GT:AD:AF:DP:F1R2:F2R1:SB    0/1:210,9:0.043:219:106,5:100,4:210,0,9,0   0/0:159,3:0.023:162:79,2:75,1:159,0,3,0 0/0:153,7:0.047:160:86,4:66,3:153,0,7,0
chr1    1818301 .   C   A   .   .   AS_MQRankSum=0.000;AS_SB_TABLE=1313,528|0,273;DP=2248;ECNT=2;MBQ=30,12;MFRL=280,289;MMQ=60,60;MPOS=12;MQ0=0;MQRankSum=0.000;NALOD=-1.126e+01;NLOD=209.63;POPAF=6.00;SOR=11.226;TLOD=28.37   GT:AD:AF:DP:F1R2:F2R1:SB    0/1:725,134:0.066:859:280,13:289,10:547,178,0,134   0/0:631,50:0.019:681:223,8:228,3:397,234,0,50   0/0:485,89:0.056:574:197,0:185,5:369,116,0,89
chr1    1818305 .   TA  T,TAA,TAAA  .   .   AS_MQRankSum=0.000;AS_SB_TABLE=139,262|122,283|78,161|30,61;DP=2431;ECNT=2;MBQ=20,25,20,28;MFRL=280,286,277,322;MMQ=60,60,60,60;MPOS=23,21,20;MQ0=0;MQRankSum=0.000;NALOD=-1.099e+02,-6.413e+01,-1.463e+01;NLOD=-9.531e+01,-4.782e+01,23.02;POPAF=6.00,6.00,6.00;RPA=24,23,25,26;RU=A;SOR=0.859;STR;TLOD=108.11,85.75,16.71 GT:AD:AF:DP:F1R2:F2R1:SB    0/1/2/3:183,186,133,47:0.297,0.250,0.084:549:63,68,40,18:83,88,51,11:68,115,124,242 0/0:83,113,32,8:0.420,0.134,0.025:236:29,45,11,2:35,48,17,2:27,56,41,112    0/0:135,106,74,36:0.272,0.215,0.106:351:44,34,22,15:52,42,30,12:44,91,65,151
chr1    1824886 .   TTTAGATATATCTGCTCCATCTAGCAGTCTATCAGAAGCTGTAGATTCCAATCATCACCGCTAGGGAAGATC    T   .   .   AS_MQRankSum=0.099;AS_SB_TABLE=839,284|8,1;DP=1281;ECNT=1;MBQ=36,24;MFRL=364,268;MMQ=60,60;MPOS=6;MQ0=0;MQRankSum=0.099;NALOD=-5.626e+00;NLOD=146.26;POPAF=6.00;SOR=1.203;TLOD=4.26 GT:AD:AF:DP:F1R2:F2R1:SB    0/1:483,3:9.849e-03:486:172,0:143,3:355,128,3,0 0/0:337,2:9.777e-03:339:112,0:104,1:262,75,2,0  0/0:303,4:0.016:307:98,0:89,2:222,81,3,1
chr1    2558208 .   C   T   .   .   AS_MQRankSum=0.000;AS_SB_TABLE=408,4|16,0;DP=568;ECNT=2;MBQ=34,25;MFRL=366,333;MMQ=60,60;MPOS=5;MQ0=0;MQRankSum=0.000;NALOD=-3.856e+00;NLOD=52.18;POPAF=6.00;SOR=0.042;TLOD=8.44    GT:AD:AF:DP:F1R2:F2R1:SB    0/1:186,9:0.048:195:109,5:76,4:183,3,9,0    0/0:124,3:0.028:127:66,2:58,1:124,0,3,0 0/0:102,4:0.041:106:45,2:56,2:101,1,4,0
chr1    2558272 .   G   T   .   .   AS_MQRankSum=0.000;AS_SB_TABLE=119,72|9,0;DP=253;ECNT=2;MBQ=38,23;MFRL=312,285;MMQ=60,60;MPOS=3;MQ0=0;MQRankSum=0.000;NALOD=-7.642e-01;NLOD=24.27;POPAF=6.00;SOR=3.638;TLOD=5.56    GT:AD:AF:DP:F1R2:F2R1:SB    0/1:89,6:0.067:95:48,2:39,4:50,39,6,0   0/0:56,2:0.042:58:33,1:22,1:39,17,2,0   0/0:46,1:0.034:47:19,0:26,1:30,16,1,0
chr1    2559459 .   C   A,T .   .   AS_MQRankSum=-0.002;AS_SB_TABLE=2474,2223|2361,534|9,538;DP=8476;ECNT=1;MBQ=20,38,9;MFRL=258,259,256;MMQ=60,60,60;MPOS=38,19;MQ0=0;MQRankSum=-0.002;NALOD=-4.677e+03,-3.075e+01;NLOD=-4.716e+03,297.50;POPAF=6.00,6.00;SOR=1.598;TLOD=3613.74,17.54 GT:AD:AF:DP:F1R2:F2R1:SB    0/1/2:2051,1250,220:0.403,0.014:3521:934,593,9:723,464,8:1100,951,1028,442  0/0:93,1645,324:0.932,0.067:2062:0,743,18:0,637,9:0,93,1341,628 0/0:2553,0,3:4.555e-04,5.825e-04:2556:1208,0,0:885,0,1:1374,1179,1,2

After the command above you get the following output:

##contig=<ID=HLA-DRB1*15:03:01:01,length=11567,assembly=Homo_sapiens_assembly38.fasta>
##contig=<ID=HLA-DRB1*15:03:01:02,length=11569,assembly=Homo_sapiens_assembly38.fasta>
##contig=<ID=HLA-DRB1*16:02:01,length=11005,assembly=Homo_sapiens_assembly38.fasta>
##filtering_status=Warning: unfiltered Mutect 2 calls.  Please run FilterMutectCalls to remove false positives.
##normal_sample=126697
##reference=file:///tmp/nxf.V9ygJBnOTq/Homo_sapiens_assembly38.fasta
##source=LeftAlignAndTrimVariants
##tumor_sample=126695
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.13+htslib-1.13+ds
##bcftools_viewCommand=view -s 126695 126695.Mutect2.merged.vcf; Date=Thu Mar 28 13:43:57 2024
##bcftools_normVersion=1.13+htslib-1.13+ds
##bcftools_normCommand=norm -m- sorted_single_sample.vcf; Date=Thu Mar 28 14:28:41 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  126695
Error: wrong number of fields in INFO/AS_MQRankSum at chr1:1793064, expected 3, found 1

Any idea what the problem is?

pd3 commented 3 months ago

You are not showing how INFO/AS_MQRankSum is defined in the header, but the error message is clear: the program expects three values but found only one. Since at that position there are three alternate alleles, most likely the tag is defined as Number=A, meaning that there should be as many values as there are alternate alleles.

To get past this error, the program that produced the VCF should be fixed or the problem reported to the developer.

A quick workaround is to either remove the tag completely (bcftools annotate -x AS_MQRankSum) or redefine it as having variable number of values using bcftools reheader

bcftools view --header-only file.vcf > hdr.txt
# edit hdr.txt and replace Number=A with Number=.
bcftools reheader -h hdr.txt file.vcf > fixed.vcf