samtools / htslib

C library for high-throughput sequencing data formats
Other
804 stars 446 forks source link

Calling tabix on S3 files via jupyter notebook fails #1625

Closed agilly closed 6 months ago

agilly commented 1 year ago

I am experiencing a very peculiar issue. I am using jupyter notebook to test some R code that calls tabix to query some files stored in S3 buckets.

The following command is an example of what I would be running on the bash command line:

AWS_PROFILE=tabix tabix s3://path/to.tsv.gz 2:1234-1234

The above command works. Now wrapping it in R (4.1.2):

out=system("AWS_PROFILE=tabix tabix s3://path/to.tsv.gz 2:1234-1234", intern=T)

This also works.

But now, using the exact same code in a jupyter notebook running IRKernel crashes. The log of the kernel reports the following:

[E::bgzf_read_block] Invalid BGZF header at offset 356561
Reading "s3://path/to.tsv.gz" failed: Resource temporarily unavailable

I am querying 100s of files but this happens only for a handful of them, systematically, for seemingly any region. I tried reindexing the file, the bug still occurs. I tried with htslib 1.12 and the latest 1.17, same thing.

This is probably going to be difficult to debug because there are many moving parts: tabix, R, and jupyter. Just checking if you have an idea why specifically calling something from jupyter would cause the error above? I cannot send the tsv files as they are confidential (and very large) but I am happy to do more exploration/debugging if you tell me what to do.

may be related to https://github.com/samtools/htslib/issues/1037

TIA!

daviesrob commented 1 year ago

The "Resource temporarily unavailable" message comes from errno getting set to EAGAIN or EWOULDBLOCK, which indicates that a system call was interrupted. This may be harmless, if the call was restarted, or it may indicate that something isn't handling such interruptions correctly.

You could try running tabix --verbosity=8 to increase the amount of debugging output printed (be warned this could be quite a lot). If we're lucky, that might give some clues as to what's going wrong.

Otherwise, if I were debugging this I'd attempt to find out what's going on at the point where bgzf_read_block reports its failure. As you're running lots of tabix processes, this might involve adding something before this return statement to either print more information about what's going on, or allow a debugger to be attached (either by dumping a core file with abort() and inspecting the results, or making it sleep for long enough for a debugger to be manually attached using gdb --pid assuming this is possible).

daviesrob commented 1 year ago

Did you manage to get any more information on this problem?

agilly commented 1 year ago

Hi, I tried increasing the verbosity, but I don't think there is much that's helpful there, the connection just seems to close (I replaced sensitive info and hashes by ... or xxx. Maybe you'll see something I missed?

If it's indeed not much help, I can go on adding tracers and recompiling tabix, but this is likely to take a long time on my side because I have to juggle between several projects at the moment.

[D::init_add_plugin] Loaded "mem"                 
[D::init_add_plugin] Loaded "crypt4gh-needed"      
[D::init_add_plugin] Loaded "libcurl"                
[D::init_add_plugin] Loaded "gcs"     
[D::init_add_plugin] Loaded "s3"                    
[D::init_add_plugin] Loaded "s3w"                                                                                                                            
*   Trying 3.5.29.212...                                 
* TCP_NODELAY set                        
* Connected to xxx.s3.amazonaws.com (x.x.x.x) port 443 (#0)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt                                                                                                                                                                                                                                                             
* found 411 certificates in /etc/ssl/certs                                                                                                                                                                                                                                                                                 
* ALPN, offering http/1.1                                                                                                                                                                                                                                                                                                  
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256                                                                                                                                                                                                                                                               
*        server certificate verification OK                                   
*        server certificate status verification SKIPPED                                                                                                                                                                                              
*        common name: *.s3.amazonaws.com (matched)                                               
*        server certificate expiration date OK                                                           
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: 
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL            
* ALPN, server accepted to use http/1.1             
> GET /path/to.tsv.gz HTTP/1.1
Host: rgc-tag-data.s3.amazonaws.com      
User-Agent: htslib/1.12 libcurl/7.58.0
Accept: */*                  
x-amz-security-token: ...
Authorization: AWS4-HMAC-SHA256 Credential=.../.../us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=...
x-amz-date: 20230530T173337Z                                  
x-amz-content-sha256: xxx

< HTTP/1.1 200 OK                                           
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:38 GMT             
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes                 
< Content-Type: text/tab-separated-values         
< Server: AmazonS3                                 
< Content-Length: 25973005583                        
<                         
* stopped the pause stream!            
* Closing connection 0                                                                                                                                       
* Hostname rgc-tag-data.s3.amazonaws.com was found in DNS cache
*   Trying 3.5.29.212...
* TCP_NODELAY set                     
* Connected to rgc-tag-data.s3.amazonaws.com (3.5.29.212) port 443 (#0)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt
* found 411 certificates in /etc/ssl/certs
*        ALPN, offering http/1.1
*        SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK                                   
*        server certificate status verification SKIPPED
*        common name: *.s3.amazonaws.com (matched)
*        server certificate expiration date OK                                        
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3                                                                                          
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL            
* ALPN, server accepted to use http/1.1             
> GET /path/to.tsv.gz HTTP/1.1
Host: rgc-tag-data.s3.amazonaws.com                      
User-Agent: htslib/1.12 libcurl/7.58.0   
Accept: */*       
x-amz-security-token: xxx
Authorization: AWS4-HMAC-SHA256 Credential=xxx/xxx/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=xxx
x-amz-date: 20230530T173338Z                                                                     
x-amz-content-sha256: xxx  
< HTTP/1.1 200 OK
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:39 GMT
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes
< Content-Type: text/tab-separated-values
< Server: AmazonS3
< Content-Length: 25973005583
< 
[I::hts_idx_check_local] Using alignment file 'to.tsv.gz'
* Found bundle for host rgc-tag-data.s3.amazonaws.com: xxx [can pipeline]
*   Trying x.x.x.x...
* TCP_NODELAY set
* Connected to rgc-tag-data.s3.amazonaws.com (3.5.29.212) port 443 (#1)
* found 137 certificates in /etc/ssl/certs/ca-certificates.crt
* found 411 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: *.s3.amazonaws.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=*.s3.amazonaws.com
*        start date: Tue, 21 Mar 2023 00:00:00 GMT
*        expire date: Tue, 19 Dec 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,CN=Amazon RSA 2048 M01
*        compression: NULL
* ALPN, server accepted to use http/1.1
> GET /path/to.tsv.gz HTTP/1.1
Host: xxx.s3.amazonaws.com
Range: bytes=3565615677-
User-Agent: htslib/1.12 libcurl/7.58.0
Accept: */*
x-amz-security-token: xxx
Authorization: AWS4-HMAC-SHA256 Credential=xxx/xxx/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=xxx
x-amz-date: 20230530T173338Z
x-amz-content-sha256: xxx

< HTTP/1.1 206 Partial Content
< x-amz-id-2: xxx
< x-amz-request-id: xxx
< Date: Tue, 30 May 2023 17:33:39 GMT
< Last-Modified: Fri, 26 May 2023 14:32:19 GMT
< ETag: "xxx"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: xxx
< Accept-Ranges: bytes
< Content-Range: bytes 3565615677-25973005582/25973005583
< Content-Type: text/tab-separated-values
< Server: AmazonS3
< Content-Length: 22407389906
< 
* stopped the pause stream!
* Closing connection 0
[E::bgzf_read_block] Invalid BGZF header at offset 3565615
Reading "s3://path/to.tsv.gz" failed: Resource temporarily unavailable
daviesrob commented 1 year ago

Unfortunately there's not much to go on there. It does look like the connection got closed before the "Invalid BGZF header" message was printed, but it's not clear why. It looks like it hangs up very quickly - it that the case? One other mystery is that the request said:

Range: bytes=3565615677-

but the error shows offset 3565615. Where did the 677 go? Was it lost when copying the messages over to GitHub?

Is the version of libcurl you're using the same when running on the command line and in the Jupyter notebook? 7.58.0 is quite old now, it might be worth trying a more up-to-date version. Apart from that, I fear you may have to resort to using strace or gdb to see if you can get more information about what's happening.

daviesrob commented 6 months ago

Hopefully this was fixed by https://github.com/samtools/htslib/pull/1676. Please add a comment if you're still having problems.