samtools / htslib

C library for high-throughput sequencing data formats
Other
812 stars 446 forks source link

Silent truncation of records in Tabix range retrieval after networking failure from S3 bucket #1851

Closed ChristopherWilks closed 2 weeks ago

ChristopherWilks commented 1 month ago

Hi,

First, thanks for the great tools, I use Tabix/Bgzip extensively in my work and am very grateful for the continued support of you folks continuously making them better (especially the extension of S3/GCS support)!

I think this may be related to this https://github.com/samtools/htslib/issues/1037, and/or if it is or part of another issue I missed in my brief search of the issues list, feel free to close/merge it in there. Related to that @daviesrob may be interested in this ticket.

I noticed recently that when running many concurrent tabix queries---using GNU parallel with -j80---against a small set of bgzipped/indexed files on an S3 bucket from an EC2 instance in the same AWS region, that I was seeing empty results from a few of them when there should have been actual records pulled down, but no errors were reported (return status was 0 for all queries).

I am using bash with set -exo pipefail, so I found this odd. [I'm fine with a minority of errors cropping up as long as they're reported---I'll just re-run those queries.]

My working hypothesis is that I'm overloading the networking stack (probably a receive buffer somewhere) on the system and that libcurl is reporting errors for a few of the concurrent jobs, but these aren't being fully caught and reported by Tabix. That said, libcurl maybe the culprit but I'm assuming it's not in this case.

I'm using version htslib 1.20, but the section of the code where I think this issue is (below) doesn't appear to be different between 1.20 and the current development branch.

I went back and added some of my own manual debug fprintf's to hfile_libcurl.c where I think the problem may be occurring, just before this line, https://github.com/samtools/htslib/blob/ca920611fcd8be1180045589ac11bff2f04eafd8/hfile_libcurl.c#L859, and compiled without optimizations to get full debugging info (not shown here but I did run a bunch of straces as well):

fprintf(stderr,"in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: %ld,%d,%d,%ld,%d\n",got,fp->finished,fp->final_result, to_skip, errno);

The one test instance where I saw something relevant was here:

in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 18882,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 32193,0,-1,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 25206,1,0,-1,0
in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0
[W::bgzf_read_block] EOF marker is absent. The input may be truncated
    Command being timed: "htslib-1.20/tabix -D s3://S3_PATH_TO_BUCKET/allpairs.byfeature.gz chr12:11456460-11457010"
....
Exit status: 0

That range has records in the bgzipped file on S3, but the output was empty and I noticed that got here was 0 which is not being caught by libcurl_read(...) in this case.

My quick and dirty solution was to simply add:

if(got == 0) { return -1; }

and that seemed to fix it (in the sense of reporting an error when this happens, which is all I want) though I haven't run extensive tests.

I'm not claiming this fixes all the issues, but it does seem to get at a potential gap in the error checking in that file.

Thanks, Chris

whitwham commented 1 month ago

Using your fprintf statement I get this: in libcurl_read: got,fp->finished, fp->final_result, to_skip, errno: 0,1,0,-1,0 at the end of every download from s3. It looks like a normal part of the process.

Can you check if it appears on your working tabixes?

whitwham commented 3 weeks ago

@ChristopherWilks, did you have a chance to do more checks?

whitwham commented 2 weeks ago

Closing because of lack of response.