This PR adds a safety check to ensure that input bams/crams are fully read in, as well as adding a retry mechanism to avoid failing in case of network issues during i/o.
Safety Check
Currently, if reading in of a bam/cram fails during pileup calling, Glimpse will happily move on with only part of the bam/cram read in. The result can be the appearance of 0 coverage, or an artificially high number of uncovered sites. This is actually a bug that was probably ported over from bcftools (see samtools/bcftools#2177, samtools/samtools#1379). The fix here, of checking that caller.n_plp >= 0 is based on the way this is handled in samtools phase, for example.
There may be other ways that this bug can be hit, but I specifically saw it when running on a large number of crams streaming from GCP in a highly parallelized cloud setup. The resulting queries to the google api can lead to 429 errors which can then hit this bug.
Technically, even with this fix, a cram can still be partially read in and then a failure to read the rest of the cram won't be caught, if the failure occurs right and the end of a cram container. This will at least print a warning from htslib that the cram doesn't contain an EOF, but Glimpse will continue on and return a non-zero exit code. As far as I can tell there is no way to catch this edge case in Glimpse without modifying htslib, so the best idea I've been able to come up with is to have pipelines watch for the htslib warning message and fail if it is seen.
Retries
Along this same situation as above, if the 429 error occurs during the loading of the file, Glimpse will fail, which is better than ignoring the error, but can lead to issues if the 429 error rate is high enough to make running the tool impractical. Unfortunately, retries are not implemented in the htslib curl library, so I instead implemented a simple retry on the Glimpse side.
This PR adds a safety check to ensure that input bams/crams are fully read in, as well as adding a retry mechanism to avoid failing in case of network issues during i/o.
Safety Check
Currently, if reading in of a bam/cram fails during pileup calling, Glimpse will happily move on with only part of the bam/cram read in. The result can be the appearance of 0 coverage, or an artificially high number of uncovered sites. This is actually a bug that was probably ported over from bcftools (see samtools/bcftools#2177, samtools/samtools#1379). The fix here, of checking that
caller.n_plp >= 0
is based on the way this is handled in samtools phase, for example.There may be other ways that this bug can be hit, but I specifically saw it when running on a large number of crams streaming from GCP in a highly parallelized cloud setup. The resulting queries to the google api can lead to 429 errors which can then hit this bug.
Technically, even with this fix, a cram can still be partially read in and then a failure to read the rest of the cram won't be caught, if the failure occurs right and the end of a cram container. This will at least print a warning from htslib that the cram doesn't contain an EOF, but Glimpse will continue on and return a non-zero exit code. As far as I can tell there is no way to catch this edge case in Glimpse without modifying htslib, so the best idea I've been able to come up with is to have pipelines watch for the htslib warning message and fail if it is seen.
Retries
Along this same situation as above, if the 429 error occurs during the loading of the file, Glimpse will fail, which is better than ignoring the error, but can lead to issues if the 429 error rate is high enough to make running the tool impractical. Unfortunately, retries are not implemented in the htslib curl library, so I instead implemented a simple retry on the Glimpse side.