samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
661 stars 240 forks source link

[E::bcf_hdr_read] Failed to read BCF header #672

Open Jolvii85 opened 7 years ago

Jolvii85 commented 7 years ago

Hi,

I am using Samtools and Bcftools to call SNPs with sugar pine as a reference (58407655 scaffolds). I met an error message "[E::bcf_hdr_read] Failed to read BCF header". Could anyone can help me fix this?

Best,

Wei

pd3 commented 7 years ago

The error comes from here https://github.com/samtools/htslib/blob/develop/vcf.c#L923. Most likely the header is too big to fit in SIZE_MAX bytes. I am not sure this check is a good idea, apparently the standard merely says that the size must be at least 65535 bytes, which is too limiting. As far as I know, the BCF specification does not impose any limit on the header size, certainly not this small. @daviesrob do you want to comment?

Jolvii85 commented 7 years ago

You are right, the header could cause this error. For the same data and commands, if I use pinus taeda as reference, there are no errors.

Could you help me to fix this problem?

Thank you!

daviesrob commented 7 years ago

SIZE_MAX is the maximum value that a size_t can hold, which is either 4294967295 or 18446744073709551615 bytes, depending on if you have a 32-bit or 64 bit platform. The check is there to ensure that hlen + 1 does not overflow on 32-bit platforms.

BCF does have a limit on the size of the header, as it stores the length in the four bytes following the BCF\2\2 "magic" string (interestingly, this does not seem to be mentioned in the specification). This means you can't have a header longer than 4294967295 bytes in a BCF2 file. Unfortunately, there's no overflow check when this length is written out so currently HTSlib will write BCF headers that can't be read back in again if the header is too long.

Given the number of scaffolds, it's quite possible that this header is longer than the limit. bcf_hdr_read is probably not reading the entire header, and failing because bcf_hdr_parse finds an incomplete record.

I'm afraid the only solutions would be to use VCF instead of BCF, or try to shrink the header somehow. Given the number of records, you would need to use less than 73 bytes per scaffold.

pd3 commented 7 years ago

As a quick temporary workaround, try to comment this line and recompile https://github.com/samtools/htslib/blob/develop/vcf.c#L914 Do not remove this line. It exists to prevent a buffer overflow on 32 bit platforms where malloc(0) returns a valid pointer:

This would be a very bad vulnerability if it were allowed to happen.

pd3 commented 7 years ago

@daviesrob My information about the SIZE_MAX variable was based on this thread. I did not investigate whether the assertion in the answer marked as best is correct https://stackoverflow.com/questions/22514803/maximum-size-of-size-t

The limit on the size of the BCF header is of course part of the implementation, but not of the specification yet.

Perhpaps the sanity check should be done differently and not rely on SIZE_MAX. We know exactly what the maximum value can be in four-byte integer. The format cannot be platform dependent.

Jolvii85 commented 7 years ago

I will try to comment that line and recompile htslib and bcftools to see whether it will work or not. I will give you the feedback later, Thank you.

pd3 commented 7 years ago

Unless you are using some obscure system, I think @daviesrob is right and it will not help - the header is probably too big to be represented in this version of BCF.

Jolvii85 commented 7 years ago

Hi @daviesrob , what do you mean by " use VCF instead of BCF"? I want to use Samtools and Bcftools pipeline for my project. I have more than 400 bam files, it is hard to shrink the header for each file. Can you give some suggestion? Thank you!

daviesrob commented 7 years ago

I mean you need to write out the text VCF format instead of the binary BCF format. You can get samtools and bcftools to do this by using the right command-line options, for example:

samtools mpileup --VCF  [ ... other options here ... ]
bcftools mpileup --output-type z [ ... other options here ... ]

For bcftools --output-type z will give you a compressed VCF file, which is best if you want to save the result. --output-type v will write uncompressed VCF, which you can view directly. When saving, you should also use a .vcf suffix instead of .bcf.