Closed kachulis closed 1 month ago
Re your note (1), this behaviour comes from the days when many tools producing BAM files did not write EOF markers, so treating its absence as a hard error would have been counter-productive. Perhaps that behaviour could be revisited now — especially for CRAM files, which probably defined an EOF marker block right from the beginning.
It would be interesting to add testing of samtools mpileup
to your script too, to check that it still checks all input errors. Apparently the mpileup code was forked into bcftools before samtools/samtools#1379 was applied, and no-one has done a similar audit of the bcftools code.
The good news is that propagating ret
in mpileup_reg()
and actually checking mpileup_reg()
's return value will probably fix most of this at a stroke, though only for mpileup
.
@jmarshall You are quite right, samtools mpileup
does catch this input error.
samtools mpileup truncated_corrupted.cram
[mpileup] 1 samples in 1 input files
samtools mpileup: error reading from input file
returns 1
samtools mpileup truncated_corrupted.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[mpileup] 1 samples in 1 input files
[E::bgzf_read_block] Failed to read BGZF block data at offset 289804 expected 9098 bytes; hread returned 8902
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools mpileup: error reading from input file
returns 1
glad to hear it is an easy fix!
Thank you for the report, this is now fixed in https://github.com/samtools/bcftools/commit/57b90721e0abc45de861361f3f175ae0edd1e715, the command bcftools mpileup
should now exit with a non-zero code. I believe the rest of bcftools is already doing the right thing and throws an error on truncated vcf.gz/bcf files.
bcftools mpileup
doesn't seem to check if there were errors while reading read data in. This means thatbcftools mpileup
can appear to have run successfully, but there may have been an underlying data reading problem. As an example, I created 3 bams and crams as follows:good.{bam,cram}
: these are a small valid bam/cram.no_eof.{bam,cram}
: these aregood.{bam,cram}
but with the eof markers removed from the end of the file.truncated_corrupted.{bam,cram}
: these aregood.{bam,cram}
but with a random set of bytes (including the eof bytes) removed from the end of the file.I then ran these files through both
bcftools mpileup
andsamtools view
, using the following script.this results in the following output:
A few things to note.
samtools view
andbcftools mpileup
will print a warning, but return a code of 0, indicating success. This feels a little counterintuitive to me (I would expect an error return code here), but certainly I see the argument for this behavior.My main concern about this is that this mimics the behavior we would see if a totally valid cram or bam was being streamed from a google bucket, and the google endpoint decided to return a 429 (or 404, or some other error code) at some point in the middle of streaming the data. I think that in this scenario bcftools would output a valid pileup file, with no indication that anything was amiss.