Open Jolvii85 opened 7 years ago
The error comes from here https://github.com/samtools/htslib/blob/develop/vcf.c#L923. Most likely the header is too big to fit in SIZE_MAX bytes. I am not sure this check is a good idea, apparently the standard merely says that the size must be at least 65535 bytes, which is too limiting. As far as I know, the BCF specification does not impose any limit on the header size, certainly not this small. @daviesrob do you want to comment?
You are right, the header could cause this error. For the same data and commands, if I use pinus taeda as reference, there are no errors.
Could you help me to fix this problem?
Thank you!
SIZE_MAX
is the maximum value that a size_t
can hold, which is either 4294967295 or 18446744073709551615 bytes, depending on if you have a 32-bit or 64 bit platform. The check is there to ensure that hlen + 1
does not overflow on 32-bit platforms.
BCF does have a limit on the size of the header, as it stores the length in the four bytes following the BCF\2\2
"magic" string (interestingly, this does not seem to be mentioned in the specification). This means you can't have a header longer than 4294967295 bytes in a BCF2 file. Unfortunately, there's no overflow check when this length is written out so currently HTSlib will write BCF headers that can't be read back in again if the header is too long.
Given the number of scaffolds, it's quite possible that this header is longer than the limit. bcf_hdr_read
is probably not reading the entire header, and failing because bcf_hdr_parse
finds an incomplete record.
I'm afraid the only solutions would be to use VCF instead of BCF, or try to shrink the header somehow. Given the number of records, you would need to use less than 73 bytes per scaffold.
As a quick temporary workaround, try to comment this line and recompile https://github.com/samtools/htslib/blob/develop/vcf.c#L914
Do not remove this line. It exists to prevent a buffer overflow on 32 bit platforms where malloc(0)
returns a valid pointer:
hlen
encoded as 0xffffffff.malloc(hlen + 1)
becomes malloc(0)
due to integer wrap-around.malloc(0)
returns a non-NULL pointer in htxt
htxt
This would be a very bad vulnerability if it were allowed to happen.
@daviesrob My information about the SIZE_MAX variable was based on this thread. I did not investigate whether the assertion in the answer marked as best is correct https://stackoverflow.com/questions/22514803/maximum-size-of-size-t
The limit on the size of the BCF header is of course part of the implementation, but not of the specification yet.
Perhpaps the sanity check should be done differently and not rely on SIZE_MAX. We know exactly what the maximum value can be in four-byte integer. The format cannot be platform dependent.
I will try to comment that line and recompile htslib and bcftools to see whether it will work or not. I will give you the feedback later, Thank you.
Unless you are using some obscure system, I think @daviesrob is right and it will not help - the header is probably too big to be represented in this version of BCF.
Hi @daviesrob , what do you mean by " use VCF instead of BCF"? I want to use Samtools and Bcftools pipeline for my project. I have more than 400 bam files, it is hard to shrink the header for each file. Can you give some suggestion? Thank you!
I mean you need to write out the text VCF format instead of the binary BCF format. You can get samtools and bcftools to do this by using the right command-line options, for example:
samtools mpileup --VCF [ ... other options here ... ]
bcftools mpileup --output-type z [ ... other options here ... ]
For bcftools --output-type z
will give you a compressed VCF file, which is best if you want to save the result. --output-type v
will write uncompressed VCF, which you can view directly. When saving, you should also use a .vcf
suffix instead of .bcf
.
Hi,
I am using Samtools and Bcftools to call SNPs with sugar pine as a reference (58407655 scaffolds). I met an error message "[E::bcf_hdr_read] Failed to read BCF header". Could anyone can help me fix this?
Best,
Wei