tabix randomly and silently dropping records retrieved from GCS bucket

mwalker174 commented 4 years ago

OS Ubuntu 18.04 (within a docker container) htslib v1.10.2 compiled from source (also when installed from the v0.7 apt package) gcloud SDK v235.0.0-0 installed from apt package

We have a large workflow that involves retrieving tabix-indexed records from a large file (~10GB) in a Google bucket. This workflow is run on VMs using the Google Compute Engine. Our workflows scatter across many shards. We have noticed that tabix seems to be dropping chunks of records randomly in multiple shards. The records are lost randomly (different records every time we retry the workflow) but consistently (it is observed in multiple shards every run). tabix does not produce any warnings or errors when this happens.

Example command:

GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token` \
      tabix -R region.bed $URI | bgzip -c > out.txt.gz

We haven't noticed any patterns other than that records seem to be dropped in contiguous groups, resulting in intervals that are only partially retrieved.

daviesrob commented 4 years ago

I suspect the web request to get data is failing, and the error is not being propagated up correctly. I'll see if I can find out where it's being lost.

daviesrob commented 4 years ago

Pull request #1040 adds more error checking to tabix, which should make it more likely to fail in a prompt fashion. Would it be possible to build that version and try it to see if the failures get reported correctly?

In my testing, the current develop branch did report an error from hts_close() on failures. However, my test was a bit crude (I simply killed the web server providing the files) so it's possible that more subtle errors might not have been reported. Hopefully the new version will make that less likely.

Also, if you're running tabix in a pipe (as in the example above), are you using the bash set -o pipefail option so that the pipeline will report the tabix failure? (If not, the result will be the one from bgzip, which presumably doesn't notice anything wrong and returns success).

mwalker174 commented 4 years ago

Thank you for the quick turnaround @daviesrob. We have been using set -o pipefail in all our scripts. We will give your branch a try and get back to you. We may get in touch with Google to understand why the errors are occurring in the first place because if GCS streaming is this unreliable we may have to implement an alternative approach.

daviesrob commented 4 years ago

Currently tabix doesn't have a way to turn up HTSlib's logging. I'll try adding that to my pull request, as it would probably help you to diagnose the problem.

daviesrob commented 4 years ago

I've added a verbosity option to my PR branch. Running something like tabix --verbosity 9 should make the libcurl interface dump a lot of information about what it's doing to stderr.

Be warned that it's quite likely to dump out your OAUTH token along with the HTTP headers, so you may want to censor the debugging output before showing it to anyone.

mwalker174 commented 4 years ago

Hello @daviesrob, thank you for adding the verbosity feature. I've tested your branch out and it does seem to be catching errors now, but we will need a bit more time to test thoroughly that there are no further issues. From what I can tell, everything seems to run smoothly until this message:

Reading "gs://talkowski-sv-gnomad-output/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz" failed

Full log here: CountSRUnder5kb-30.censored.log Does this indicate that the issue is probably with GCS? I can open a ticket with them.

daviesrob commented 4 years ago

That's annoying. In spite of all that extra reporting, tabix hasn't emitted an error message. Do you know what emitted the Reading ... failed message, as it doesn't look like it came from HTSlib? Also, how much of the expected data did you get? It looks like it managed to download 400 regions before it stopped - how does that match to the number of locations in your .bed file? (Note that it may be less as HTSlib will just discard data for shortish hops forwards.)

From the logs, it doesn't look like there were any problems from the GCS web server, and the last request proceeded in exactly the same way as the earlier ones.

mwalker174 commented 4 years ago

The bed contained 3504 regions: region.merged.bed.txt I can't see what else besides tabix could have emitted the message because gs://talkowski-sv-gnomad-output/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz does not appear anywhere else in the script:

set -euo pipefail
svtk vcf2bed --split-bnd --no-header lt5kb.aaaabe.vcf.gz test.bed
awk -v OFS="\t" '{if ($2-250>0){print $1,$2-250,$2+250}else{print $1,0,$2+250}}' test.bed  >> region.bed
awk -v OFS="\t" '{if ($3-250>0){print $1,$3-250,$3+250}else{print $1,0,$3+250}}' test.bed  >> region.bed
sort -k1,1 -k2,2n region.bed > region.sorted.bed
bedtools merge -i region.sorted.bed > region.merged.bed
GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token` \
  /usr/local/bin/tabix --verbosity 9 -R region.merged.bed gs://talkowski-sv-gnomad-output/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz | bgzip -c > SR.txt.gz
tabix -b 2 -e 2 SR.txt.gz
svtk count-sr --index SR.txt.gz.tbi -s write_lines_95c8145fee50a53ff0f2f0a5092b501c.tmp --medianfile 1KGP_2504.PCRMINUS.batch_0_0_medianCov.transposed.bed lt5kb.aaaabe.vcf.gz SR.txt.gz lt5kb.aaaabe.vcf.gz.sr_counts.txt
/opt/sv-pipeline/04_variant_resolution/scripts/sum_SR.sh lt5kb.aaaabe.vcf.gz.sr_counts.txt lt5kb.aaaabe.vcf.gz.sr_sum.txt.gz
gzip lt5kb.aaaabe.vcf.gz.sr_counts.txt

Edit: It looks like the message was emitted from here? https://github.com/samtools/htslib/blob/3b1ff68575c9db582a122048a7e95aa2b1a47b11/tabix.c#L289

daviesrob commented 4 years ago

Yes, you're correct. Not sure how I managed to miss that. Anyway, it looks like it's failing without setting errno.

Does it always fail on this file, or are the failures random? I notice that get_tid() in tbx.c could return -1 without setting errno if the sequence name doesn't exist in its dictionary (and is_add == 0, which is how it is called from the iterator). It's called from get_intv(), which interprets a -1 return as an out of memory error, and this will propagate all the way up to tbx_itr_next() which will return a failure code (-2). So if this is happening, your file should contain a sequence name that's not in the .tbi index, and will always fail.

The other place where it might fail silently in the bgzf_read_block(), for example if it doesn't read a whole bgzf header and a few other places in the same function. Again, if the file is malformed this will be deterministic, but it may also occur randomly if you're getting short reads or other failures while downloading the data (although these ought to be pretty noisy).

It looks like more logging messages would be useful, and the get_tid() function needs fixing so it's possible to tell unexpected sequences apart from memory failures.

mwalker174 commented 4 years ago

The errors are non-deterministic, and we usually see records dropped in contiguous chunks. This seems consistent with some kind of packet loss (but not sure if that's possible here). If that were the case it would catch here?

daviesrob commented 4 years ago

Pull request #1049 adds more error reporting to BGZF. Could you give it a try, please? Hopefully it will provide more clues as to where the problem is.

mwalker174 commented 4 years ago

Thank you! This is what I'm seeing in the logs now:

...
* Closing connection 678
* Connection #679 to host talkowski-sv-gnomad-output.storage-download.googleapis.com left intact
[E::bgzf_read_block] Invalid BGZF header at offset 1626705469
Reading "gs://talkowski-sv-gnomad-output/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz" failed

daviesrob commented 4 years ago

That means it didn't find the expected GZIP signature at offset 1626705469. Was that at the start of the last request, or some distance into it? It should be possible to tell by looking for the Content-Range: bytes header in the log of the last HTTP request.

It would also be useful to know what's in the file at that position, which should be possible by running:

curl -s -r 1626705469- 'https://talkowski-sv-gnomad-output.storage-download.googleapis.com/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz' | hd |  head -n 2

If should look something like this (the first 12 bytes ought to match exactly, the later ones will vary):

00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  fd 23 ed 9d 5d 73 1c b7  92 a6 af 39 bf a2 e3 9c  |.#..]s.....9....|

If it doesn't, HTSlib may have attempted to read the wrong part of the file; whereas if it does, it's more likely that the received data was corrupted in some way.

mwalker174 commented 4 years ago

Here is the full log: CountSRUnder5kb-5.censored.log

I noticed that the content range is much smaller than the previous chunks:

< Content-Range: bytes 1626705469-10964074170/10964074171
< Content-Length: 19

And there was an HTTP 503 error:

< HTTP/1.1 503 Service Unavailable

Curiously, it looks like this occurred in other failed tasks on this run. However, I looked back at the logs from previous runs and they do not have 503. Did you change anything in the latest commit that may have changed this behavior?

Edit: I realized there was a problem with the hexdump, will post momentarily.

mwalker174 commented 4 years ago

Hexdump:

$ curl -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -s -r 1626705469- 'https://talkowski-sv-gnomad-output.storage-download.googleapis.com/1KGP/matrix/1KGP_2504.PCRMINUS.batch_0_0.SR.txt.gz' | hexdump -C |  head -n 2
00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  95 24 8d 9d cb 8e e5 3c  93 18 d7 9f 9f 46 a2 c4  |.$.....<.....F..|

daviesrob commented 4 years ago

The hexdump looks OK. There are, as far as I'm aware, no changes that would have made the 503 error appear. It could be a different manifestation of the existing problem, or possibly a new one. At least with this, HTSlib knows something went wrong with the transfer. The earlier failure where the server apparently sent the wrong data is more annoying as it's impossible to tell a failed transfer apart from an actual bad file.

tabix does make rather a lot of web requests when attempting to download this data. I notice from your log files that quite a lot of them are reading from the same file location, probably because the data you want comes from the same BGZF block. There's a caching feature in the BGZF library that should stop this, but tabix doesn't use it yet. I'll try adding it to see if that makes your downloads more reliable.

mwalker174 commented 4 years ago

Hi @daviesrob, just checking in to see if you have progressed on implementing the block caching feature. I do agree that would probably help. Another idea is to add automatic retries in the case of failed requests.

daviesrob commented 4 years ago

Sorry, I've been diverted for the past few days by virtual conferences and an (unrelated) diversion into amplicon sequencing.

Meanwhile, I've tried adding BGZF caching to tabix, and disappoiningly it ended up making the problem worse in my test. The trouble is that bgzf's load_from_cache() calls hseek on a cache hit so that the underlying file pointer keeps track of where it would have been if the block had actually been read. While this isn't much of a problem when reading normal files, it could cause lots of http requests for remote files. To keep the number of requests down, commit ad2e24f added a work-around to hfile_libcurl that delayed doing the "seek" operation until an attempt was made to actually read some new data. Unfortunately it looks like this isn't working as well as it should. I'm currently trying to come up with a way of fixing it.

While retries would also be useful, I'm not sure if they would solve the problem completely either. From your comments it looks like at least some of the errors have been getting have been bad data from an apparently-working connection, rather than a request failure or short output. Unfortunately that sort of error gets detected fairly high up in the stack. Attempting to retry from that point could get rather complicated.

daviesrob commented 4 years ago

Right, I think I may have a solution now. Could you give pull request #1053 a try, please?

In my tests, it considerably reduced the number of http requests. Hopefully that will lead to less load on the server providing your data and make the transfers a bit more reliable.

mwalker174 commented 4 years ago

Thank you for the update, @daviesrob. This update may be reducing the number of failures (2 this run compared to 6 last time). Neither threw the HTTP-404 this time, but the errors appear to be different:

tabix: hfile_libcurl.c:782: libcurl_read: Assertion `fp->base.offset == fp->delayed_seek && fp->base.begin == fp->base.buffer && fp->base.end == fp->base.buffer' failed.

and

[E::bgzf_read_block] Invalid BGZF header at offset 10211680883

Here are the two log files:

CountPE-9.censored.log CountPE-15.censored.log

daviesrob commented 4 years ago

The first error was due to a bug - I've pushed an update to the pull request that should fix it.

The second one looks more like a manifestation of the original problem.

mwalker174 commented 4 years ago

That seems to have cleared up the libcurl error, and I am still seeing the Invalid BGZF header error. CountPEUnder5kb-27.censored.log It seems to me we may have gotten as far as we can go on the client end, and we can probably proceed with the failure rate this low in our workflows. Are there any additional tweaks you would like to try? If not, I can bring this remaining issue up with GCP engineers.

daviesrob commented 4 years ago

Yes, I think we're reaching the limits of what we can do, apart maybe from adding even more logging.

I guess if the failure rate is low enough now, you can at least work around the problem by repeating the tabix download if it fails the first time.

mwalker174 commented 4 years ago

Hello @daviesrob, thank you for all your work thus far working through this issue. I have done some more testing and seen peculiar behavior with the Invalid BGZF header bug. This error seems to be reproducible. I have a shell script to run tabix back-to-back and verify that the two downloads were identical:

GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token` \
  /usr/local/bin/tabix --verbosity 9 -h -R "${INTERVALS}" "${URI}" | bgzip -c > "${OUTPUT}"
GCS_OAUTH_TOKEN=`gcloud auth application-default print-access-token` \
  /usr/local/bin/tabix --verbosity 9 -h -R "${INTERVALS}" "${URI}" | bgzip -c > "__test__"

Natively on my local machine (MacOS High Sierra) this works fine, no problems. In Ubuntu 18.04 running on Google Compute VMs (both "natively" and within a docker container) it often gives the header error at the exact same place (retrieving the same blocks), but does succeed about half of the time. (Strangely, tabix gets a Forbidden error when I try an Ubuntu docker container on my local machine, but this may simply be a docker config issue). There does not seem to be any problem with the index file (hexdump showed |............BC..| as before, and a fresh index was identical). Do you have any thoughts? Could this be an issue with a dependency like zlib?

daviesrob commented 4 years ago

I doubt that it's a problem with zlib as that should be completely deterministic. It must either be GCS not always sending the same data back, or something randomly altering it between libcurl decrypting it and BGZF trying to uncompress the block.

I think the next step might be to get HTSlib to dump out the bytes that caused it to report a failure, so we can see exactly what came back.

samtools / htslib

tabix randomly and silently dropping records retrieved from GCS bucket #1037