samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 446 forks source link

htslib seek/libcurl_seek() doesn't fail on errors #604

Open ramyala opened 7 years ago

ramyala commented 7 years ago

The following python code would call into htslib seek() and hts_iter_next() for reads. When libcurl_seek fails the following code still passes and leads to erronous output. I've noticed this happen when connections fail with ssl issues or 503 Service Unavailable errors. Ideal behavior should be for libcurl_seek to return a failed code instead of success which would allow an exception to be triggered on pysam (so user can either re-establish connection or handle the issue gracefully).

num_reads = 0
for read in pysam.fetch(chrom, start, end):
   process(read)
   num_reads += 1
print "my_process:Total reads: ", num_reads
[I::hseek] fp: 0xc6d37e8520, mode: SET, seek to: 5895168049 
* Found bundle for host gdc-api.nci.nih.gov: 0xc6d3816310 [can pipeline] 
* Hostname gdc-api.nci.nih.gov was found in DNS cache 
*   Trying 192.170.230.228... 
* TCP_NODELAY set 
* Connected to gdc-api.nci.nih.gov (192.170.230.228) port 443 (#71) 
* found 664 certificates in /etc/ssl/certs 
* ALPN, offering http/1.1 
* gnutls_handshake() failed: Error in the pull function. 
* Curl_http_done: called premature == 1 
* stopped the pause stream! 
* Closing connection 71 
INFO:my_process:Total reads: 0 

Error 2:

[I::hts_itr_query] idx: 0x6525a40da0, tid: 6, beg: 51396407, end: 51396425 
[I::hts_itr_next] GRAB MORE: bgzf_fp: 0x65249c7a10, iter: finished: 0, read_rest: 0, curr: (0:0-0@0x0), inst: (6:51396407-51396425@1), i: -1, n_off: 1, offset: 0x65249642d0 
[I::hseek] fp: 0x65239704d0, mode: SET, seek to: 4518809043 
* Found bundle for host gdc-api.nci.nih.gov: 0x652399b310 [can pipeline] 
* Hostname gdc-api.nci.nih.gov was found in DNS cache 
*   Trying 192.170.230.228... 
* TCP_NODELAY set 
* Connected to gdc-api.nci.nih.gov (192.170.230.228) port 443 (#49) 
* found 664 certificates in /etc/ssl/certs 
* ALPN, offering http/1.1 
* SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384 
*    server certificate verification OK 
*    server certificate status verification SKIPPED 
*    common name: gdc-api.nci.nih.gov (matched) 
*    server certificate expiration date OK 
*    server certificate activation date OK 
*    certificate public key: RSA 
*    certificate version: #3 
*    subject: C=US,postalCode=60305,ST=IL,L=River Forest,street=400 Lathrop Avenue,O=Open Cloud Consortium,OU=PremiumSSL,CN=gdc-api.nci.nih.gov 
*    start date: Thu, 18 Jun 2015 00:00:00 GMT 
*    expire date: Sun, 17 Jun 2018 23:59:59 GMT 
*    issuer: C=GB,ST=Greater Manchester,L=Salford,O=COMODO CA Limited,CN=COMODO RSA Organization Validation Secure Server CA 
*    compression: NULL 
* ALPN, server did not agree to a protocol 
> GET /data/a8fe0928-3d0c-4d16-be11-a5dd077bdb2c HTTP/1.1  
Host: gdc-api.nci.nih.gov  
Range: bytes=4518809043-  
User-Agent: htslib/1.4.1-86-g4a68964 libcurl/7.52.1  
Accept: */*  
X-Auth-Token:  REMOVED
* The requested URL returned error: 503 SERVICE UNAVAILABLE 
* Curl_http_done: called premature == 1 
* stopped the pause stream! 
* Closing connection 49 
INFO:my_process:Total reads: 0 
daviesrob commented 7 years ago

What logging level have you set? If hts_verbose is greater than 8, line 722 stops CURLOPT_FAILONERROR from being set. This stops curl from reporting the HTTP result code at the point where libcurl_seek() expects to find out about it.

We should probably make the overlap between enhanced logging and not setting CURLOPT_FAILONERROR a bit bigger, as it's quite easy to be caught out by this at the moment.

ramyala commented 7 years ago

@daviesrob I made sure that wasn't the case by setting the log level to 8.

Another issue is for workloads where fetch() happens across compact BED file ranges, there is an onerous amount of new connections being established on each seek. It might actually be better to read from the connection instead of seeking to that offset if the old and new offsets differ by less than say 1MB wide? This is especially a problem for services like TCGA which rate limit or have problems handling large number of range queries at once. If you have better ideas around this happy to take that into consideration as I work around this issue.

daviesrob commented 7 years ago

If you leave the log level at the default (3, which is HTS_LOG_WARNING) does it work properly? And if not, can you get the same thing to happen when using htslib's test/test_view program (which allows you to specify a region)? Getting the python part of this out of the way would make for easier debugging.

Making libcurl_seek not try to do small seeks has crossed my mind, but I haven't got around to implementing it yet. Unfortunately even with this I suspect it's not going to behave very well. HTTP is designed for streaming, not random access, so it will never be a good fit.