samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

Issue accessing S3-stored VCFs #1541

Closed agilly closed 1 year ago

agilly commented 1 year ago

This issue arises for both the latest release (1.16) and today's pull from the development branch.

Situation: Both a vcf.gz and its vcf.gz.tbi index file are stored in an S3 bucket. Code below is being run from an AWS instance running Ubuntu.

Behavior: Tabix queries (for example ./tabix -l s3://path/to/file.vcf.gz fail with error:

[E::idx_find_and_load] Could not retrieve index file for 's3://path/to/file.vcf.gz'
Could not load .tbi index of s3://path/to/file.vcf.gz: Permission denied

This error does not occur when a local copy of the .tbi exists in the cwd (after fetching with aws s3 cp). This makes sense since the index is now local, and -l triggers tbx_seqnames(...).

When a range is provided, like:

./tabix s3://path/to/file.vcf.gz 1:1000-1001

an error triggers, which would suggest s3 access is not possible at all. Here is what I get with --verbosity 9:

[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "crypt4gh-needed"
[D::init_add_plugin] Loaded "libcurl"
[D::init_add_plugin] Loaded "gcs"
[D::init_add_plugin] Loaded "s3"
[D::init_add_plugin] Loaded "s3w"
*   Trying X.X.X.X...
* TCP_NODELAY set
* Connected to host.s3.amazonaws.com (X.X.X.X) port 443 (#0)
* found 124 certificates in /etc/ssl/certs/ca-certificates.crt
* found 372 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*        server certificate verification OK
*        server certificate status verification SKIPPED
*        common name: *.s3.amazonaws.com (matched)
*        server certificate expiration date OK
*        server certificate activation date OK
*        certificate public key: RSA
*        certificate version: #3
*        subject: CN=*.s3.amazonaws.com
*        start date: Wed, 21 Sep 2022 00:00:00 GMT
*        expire date: Sat, 26 Aug 2023 23:59:59 GMT
*        issuer: C=US,O=Amazon,OU=Server CA 1B,CN=Amazon
*        compression: NULL
* ALPN, server accepted to use http/1.1
> GET /path/to/file.vcf.gz HTTP/1.1
Host: rgc-ag-data.s3.amazonaws.com
User-Agent: htslib/1.16-32-gd7737aa-dirty libcurl/7.58.0
Accept: */*

< HTTP/1.1 403 Forbidden
< x-amz-request-id: ******************
< x-amz-id-2: ********************************************************
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Wed, 21 Dec 2022 15:31:23 GMT
< Server: AmazonS3
< 
* stopped the pause stream!
* Closing connection 0
[E::hts_open_format] Failed to open file "s3://path/to/file.vcf.gz" : Permission denied
Could not open "s3://path/to/file.vcf.gz": Permission denied

I can confirm that I have access to both files via e.g. aws s3 cp. Does tabix require special permissions to be enabled?

daviesrob commented 1 year ago

As far as I'm aware it shouldn't need any special permissions. If aws s3 cp works but tabix doesn't then I'd suspect a mismatch between how the two programs are getting the credentials needed to access your S3 bucket. Depending on how your AWS instance was set up, you may have to give HTSlib some hints about how to get them. The files it looks at, and the environment variables that can be used to influence it are documented in the htslib-s3-plugin manual page.

agilly commented 1 year ago

Thank you for your reply @daviesrob. We figured it was some kind of permissions issue so we tried a few things with colleagues. We used this curl command curl http://169.254.169.254/latest/meta-data/iam/security-credentials/team-name and added the keyid, accesskey and token in the .aws/credentials file which is one that is compatible with htslib. However that only got us one bit further, to another error. This time the file seems to be stat-able but not actually readable:

[E::test_and_fetch] Failed to close remote file s3://path/to/vcf.gz.tbi
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes

Any pointers on what to investigate next?

daviesrob commented 1 year ago

It's a bit difficult to say on the information available. You might want to try boosting the verbosity again to see if it gives any hints. Also, have you switched to an old version of HTSlib? The "[E::test_and_fetch] Failed to close remote file" message only existed in that form between releases 1.5 and 1.10 (after which the function was renamed to idx_test_and_fetch).

It looks like you're using IAM credentials. Could they have expired while your process was running (it would have been going for quite a long time)? If that's the case, you could try the script in the short-lived credentials section of the htslib-s3-plugin manual page. The idea is that you run it in the background, where it wakes up occasionally and downloads a new set of credentials before the old ones have expired. HTSlib's S3 plugin will then refresh its stored credentials from the file if they're about to expire (note that this only works in version 1.16).

agilly commented 1 year ago

Thanks @daviesrob, there was indeed a hiccup where we inadvertently switched to 1.12 while using your script. Running the keepalive script you mentioned solves the issue when using 1.16. Thanks! Closing issue.