samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
678 stars 240 forks source link

bcftools merge does not work #1954

Open sjellerstrand opened 1 year ago

sjellerstrand commented 1 year ago

Hej! I have found a potential bug in the develomental version. I get an error message when trying to merge several vcf files with bcftools merge. I have tried the the latest source code commit b7b2a32 for the following code. However, version 1.14, 1.16 and 1.17 works fine.

Command:

merge_files=$(echo ./Rasolark_2021_CADDXX010000136.1_HETGAM_95694_hetsex.vcf.gz ./Rasolark_2021_CADDXX010000136.1_HETGAM_TT95866_hetsex.vcf.gz)

bcftools merge $merge_files | bgzip -c > output.vcf.gz

The vcf file is not complete. It stops abruptly and has the following message at the end of the file:

[W::vcf_parse_info] INFO 'C' is not defined in the header, assuming Type=String
[E::bcf_write] Broken VCF record, the number of columns at CADDXX010000136.1:99689 does not match the number of samples (0 vs 2)
[main_vcfview] Error: cannot write to (null)

There is no INFO field called "C". What I do have is:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

The site in the individual vcf files look like this:

# Rasolark_2021_CADDXX010000136.1_HETGAM_95694_hetsex.vcf.gz
CADDXX010000136.1       99824   .       TA      T       0       .       AC=1;AF=1;AN=1;CM=0.091753;NS=18        GT      1

#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95866_hetsex.vcf.gz
CADDXX010000136.1       99824   .       TA      T       0       .       AC=1;AF=1;AN=1;CM=0.091753;NS=18        GT      1

However, the problem does not seem to be that site in particular. The site changes depending on if I run more/other samples. For example if I merge 4 other samples the site 99824 is processed and the file instead ends at site 100730 with the message:

[E::bcf_write] Broken VCF record, the number of columns at CADDXX010000136.1:100730 does not match the number of samples (0 vs 4)
[main_vcfview] Error: cannot write to (null)

And if I merge all my 9 files I get the following message:

[E::vcf_parse_format] FORMAT column with no sample columns starting at CADDXX010000136.1:100636
Error: VCF parse error

Neither is there any obvious issue with this site in any of theindividual files:

#Rasolark_2021_CADDXX010000136.1_HETGAM_95694_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95866_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95867_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95871_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95879_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95881_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95884_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95887_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1
#Rasolark_2021_CADDXX010000136.1_HETGAM_TT95888_hetsex.vcf.gz
CADDXX010000136.1       100636  .       C       G       0       .       AC=1;AF=0.25;AN=1;CM=0.092565;NS=18     GT      1

Both cases are fixed for the alternate allele, but other site like it are being processed without any issue.

Again, there is none of these problems in previous versions.

pd3 commented 1 year ago

I suspect some of the input VCF is broken in some way and while the earlier versions silently ignore it, returns an error, as clearly indicated by the error message

[E::vcf_parse_format] FORMAT column with no sample columns starting at CADDXX010000136.1:100636

For this we really need a proper test to reproduce the problem. Try to run bcftools view -o /dev/null input.vcf on all inputs, likely at least one of them will end with the same error.

wangmingcheng commented 1 year ago

bcftools merge Segmentation fault (core dumped)

图片

and Version: 1.15.1 (using htslib 1.15.1) work fine

pd3 commented 1 year ago

@wangmingcheng as stated above, any chance you could provide a small test case to reproduce the problem? It is not possible to help without being able to reproduce the problem locally.