samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

S3 plugin does not correctly handle 307 redirects for newly created buckets #1760

Closed andaca closed 6 months ago

andaca commented 6 months ago

When you specify an S3 URL which does not include a region, you may receive from AWS either a 400 response with a body specifying where to retry, or - if the bucket is relatively new - a 307 response.

The HTSLib S3 plugin does not correctly follow the 307 redirects that AWS returns for newly created buckets. Note that, in the logs below, rather than sending a HTTPS request to the specified location (https://samtools-reads-data.s3.eu-west-2.amazonaws.com/r.cram), it sends a HTTP request, leaving out the bucket name (http://s3.eu-west-2.amazonaws.com/r.cram), and so will always return a 404.

Logs below are from 1.18 on MacOS, but have also seen this in 1.19 on AmazonLinux.

From my experience, AWS will start sending 400 responses rather than 307s the day after the bucket was created, at which point the "samtools view" command works.

❯ aws s3api create-bucket --bucket samtools-reads-data --create-bucket-configuration LocationConstraint=eu-west-2

❯ samtools view s3://samtools-reads-data/r.cram --verbosity 10
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "crypt4gh-needed"
[D::init_add_plugin] Loaded "libcurl"
[D::init_add_plugin] Loaded "gcs"
[D::init_add_plugin] Loaded "s3"
[D::init_add_plugin] Loaded "s3w"
*   Trying 3.5.7.115:443...
* Connected to samtools-reads-data.s3.amazonaws.com (3.5.7.115) port 443
* ALPN: curl offers h2,http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=*.s3.amazonaws.com
*  start date: Oct 10 00:00:00 2023 GMT
*  expire date: Jul  3 23:59:59 2024 GMT
*  subjectAltName: host "samtools-reads-data.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* using HTTP/1.1
> GET /r.cram HTTP/1.1
Host: samtools-reads-data.s3.amazonaws.com
User-Agent: htslib/1.18 libcurl/8.4.0
Accept: */*
Authorization: REDACTED
x-amz-date: 20240319T145000Z
x-amz-content-sha256: READACTED
X-Amz-Security-Token: REDACTED

< HTTP/1.1 307 Temporary Redirect
< x-amz-bucket-region: eu-west-2
< x-amz-request-id: REDACTED
< x-amz-id-2: REDACTED
< Location: https://samtools-reads-data.s3.eu-west-2.amazonaws.com/r.cram
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Tue, 19 Mar 2024 14:50:00 GMT
< Server: AmazonS3
<
*   Trying 52.95.191.45:80...
* Connected to s3.eu-west-2.amazonaws.com (52.95.191.45) port 80
> GET /r.cram HTTP/1.1
Host: s3.eu-west-2.amazonaws.com
User-Agent: htslib/1.18 libcurl/8.4.0
Accept: */*
Authorization: REDACTED
x-amz-date: 20240319T145000Z
x-amz-content-sha256: REDACTED
X-Amz-Security-Token: REDACTED

< HTTP/1.1 404 Not Found
< x-amz-request-id: REDACTED
< x-amz-id-2: REDACTED
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Tue, 19 Mar 2024 14:50:00 GMT
< Server: AmazonS3
<
* Closing connection
* Closing connection
[E::hts_open_format] Failed to open file "s3://samtools-reads-data/r.cram" : No such file or directory
samtools view: failed to open "s3://samtools-reads-data/r.cram" for reading: No such file or directory

Documentation of the 307 redirect: https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingRouting.html

daviesrob commented 6 months ago

The develop branch recently got an update (#1756) which made it follow 307 redirects. Does that fix your problem?

andaca commented 6 months ago

Legend!

The redirect does now find the file, so we now get the correct data. However, the request is still sending a HTTP request rather than the expected HTTPS request, which isn't ideal - though may require a ticket of it's own?

[root@10c2175e13ea samtools]# samtools view s3://MYBUCKET/NA12878/alignment/NA12878.cram chr2:178523221-178523636 --verbosity 10 1>/dev/null
[D::init_add_plugin] Loaded "mem"
[D::init_add_plugin] Loaded "crypt4gh-needed"
[M::hfile_s3_write.init] version 1.19.1-27-g78e507db
[D::init_add_plugin] Loaded "/usr/local/libexec/htslib/hfile_s3_write.so"
[D::init_add_plugin] Loaded "/usr/local/libexec/htslib/hfile_libcurl.so"
[M::hfile_s3.init] version 1.19.1-27-g78e507db
[D::init_add_plugin] Loaded "/usr/local/libexec/htslib/hfile_s3.so"
[M::hfile_gcs.init] version 1.19.1-27-g78e507db
[D::init_add_plugin] Loaded "/usr/local/libexec/htslib/hfile_gcs.so"
* processing: https://MYBUCKET.s3.amazonaws.com/NA12878/alignment/NA12878.cram
*   Trying 3.5.29.112:443...
* Connected to MYBUCKET.s3.amazonaws.com (3.5.29.112) port 443
* ALPN: offers h2,http/1.1
*  CAfile: /etc/pki/tls/certs/ca-bundle.crt
*  CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=*.s3.amazonaws.com
*  start date: Oct 10 00:00:00 2023 GMT
*  expire date: Jul  3 23:59:59 2024 GMT
*  subjectAltName: host "MYBUCKET.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* using HTTP/1.1
> GET /NA12878/alignment/NA12878.cram HTTP/1.1
Host: MYBUCKET.s3.amazonaws.com
User-Agent: htslib/1.19.1-27-g78e507db libcurl/8.2.1
Accept: */*
Authorization: REDACTED
x-amz-date: 20240321T145922Z
x-amz-content-sha256: REDACTED
X-Amz-Security-Token: REDACTED

< HTTP/1.1 307 Temporary Redirect
< x-amz-bucket-region: eu-west-2
< x-amz-request-id: P98TDRFRFMBFFX0J
< x-amz-id-2: REDACTED
< Location: https://MYBUCKET.s3.eu-west-2.amazonaws.com/NA12878/alignment/NA12878.cram
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Thu, 21 Mar 2024 14:59:22 GMT
< Server: AmazonS3
<
* processing: MYBUCKET.s3.eu-west-2.amazonaws.com/NA12878/alignment/NA12878.cram
*   Trying 3.5.245.174:80...
* Connected to MYBUCKET.s3.eu-west-2.amazonaws.com (3.5.245.174) port 80
> GET /NA12878/alignment/NA12878.cram HTTP/1.1
Host: MYBUCKET.s3.eu-west-2.amazonaws.com
User-Agent: htslib/1.19.1-27-g78e507db libcurl/8.2.1
Accept: */*
Authorization: REDACTED 
x-amz-date: 20240321T145922Z
x-amz-content-sha256: REDACTED
X-Amz-Security-Token: REDACTED

< HTTP/1.1 200 OK
< x-amz-id-2: REDACTED
< x-amz-request-id: REDACTED
< Date: Thu, 21 Mar 2024 14:59:24 GMT
< Last-Modified: Thu, 21 Mar 2024 14:54:31 GMT
< ETag: REDACTED
< x-amz-server-side-encryption: AES256
< Accept-Ranges: bytes
< Content-Type: binary/octet-stream
< Server: AmazonS3
< Content-Length: 15797182294
<
* Closing connection
[I::hts_idx_check_local] Using alignment file 'NA12878.cram'
* processing: http://MYBUCKET.s3.eu-west-2.amazonaws.com/NA12878/alignment/NA12878.cram
* Found bundle for host: 0x626800 [serially]
* Can not multiplex, even if we wanted to
* Hostname MYBUCKET.s3.eu-west-2.amazonaws.com was found in DNS cache
*   Trying 3.5.245.174:80...
* Connected to MYBUCKET.s3.eu-west-2.amazonaws.com (3.5.245.174) port 80
> GET /NA12878/alignment/NA12878.cram HTTP/1.1
Host: MYBUCKET.s3.eu-west-2.amazonaws.com
Range: bytes=2192306292-
User-Agent: htslib/1.19.1-27-g78e507db libcurl/8.2.1
Accept: */*
Authorization: REDACTED 
x-amz-date: 20240321T145922Z
x-amz-content-sha256: REDACTED
X-Amz-Security-Token: REDACTED

< HTTP/1.1 206 Partial Content
< x-amz-id-2: REDACTED
< x-amz-request-id: REDACTED
< Date: Thu, 21 Mar 2024 14:59:24 GMT
< Last-Modified: Thu, 21 Mar 2024 14:54:31 GMT
< ETag: REDACTED
< x-amz-server-side-encryption: AES256
< Accept-Ranges: bytes
< Content-Range: bytes 2192306292-15797182293/15797182294
< Content-Type: binary/octet-stream
< Server: AmazonS3
< Content-Length: 13604876002
<
* Closing connection

Here's the version I built:

samtools 1.19.2-18-gd4c981b
Using htslib 1.19.1-27-g78e507db
Copyright (C) 2024 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2
    LDFLAGS:        
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=yes plugin-path=/usr/local/libexec/htslib: htscodecs=1.6.0
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2 -fvisibility=hidden
    LDFLAGS:        -fvisibility=hidden -rdynamic

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    Google Cloud Storage:    gs+http, gs+https, gs
    Amazon S3:   s3+https, s3+http, s3
    libcurl:     imaps, pop3, gophers, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, mqtt, ldap, https, ldaps, smbs, tftp, pop3s, dict
    S3 Multipart Upload:     s3w, s3w+https, s3w+http
    crypt4gh-needed:     crypt4gh
    mem:     mem

And here's how I build it, in case I made a silly mistake somewhere:

FROM fedora:latest

ENV LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"

RUN dnf update -y
RUN dnf install -y tar bzip2 wget gcc make vim gdb zlib-devel bzip2-devel zlib-devel \
           bzip2-devel xz-devel openssl-devel ncurses-devel automake autoconf git \
           libcurl-devel

WORKDIR /git
RUN git clone https://github.com/samtools/samtools
RUN git clone --recurse-submodules https://github.com/samtools/htslib

WORKDIR /git/htslib
RUN git fetch && git switch develop
RUN autoreconf -i
RUN ./configure --enable-plugins --enable-libcurl --enable-s3
RUN make
RUN make install

WORKDIR /git/samtools
RUN autoreconf -i
RUN ./configure --with-htslib=system
RUN make
RUN make install
daviesrob commented 6 months ago

Hmm, that shouldn't happen. I suspect the problem may be in redirect_endpoint_callback, but I'll have to check.

daviesrob commented 6 months ago

I found a way to reproduce the problem on the public 1000genomes bucket. Hopefully #1762 will fix it.