samtools / htslib

C library for high-throughput sequencing data formats
Other
812 stars 444 forks source link

support for non-amazonaws endpoint urls (e.g. s3.<company>.com) #436

Open blajoie opened 8 years ago

blajoie commented 8 years ago

We are utilizing EMC on-premise s3-compatible storage.

Htslib does not accommodate non (.s3.amazonaws.com) endpoint urls.

We can successfully modify, re-compile and use htslib/samtools if we swap in our own endpoint url as below:

line 846 of hfile_libcurl.c

// Use virtual hosted-style access if possible, otherwise path-style.
    if (is_dns_compliant(bucket, path)) {
        kputsn(bucket, path - bucket, &url);
        kputs("<custom_endpoint_url>", &url);
    }
    else {
        kputs("<custom_endpoint_url>/", &url);
        kputsn(bucket, path - bucket, &url);
    }
    kputs(path, &url);

Can htslib be modified to accept custom endpoint urls from all potential config locations? e.g.

  1. embeded within link - s3://id:secret:endpoint@bucket/ (or similar)
  2. env variable - AWS_ENDPOINT_URL=
  3. within ~/.aws/config - aws_endpoint_url= (or some other standard, extracted from aws profile?)

Or modify to utilize a 's3_domain' parameter?

Looks like you've already touched on a few of these related to gcs: https://github.com/samtools/htslib/pull/390

- It would also be nice to modify the s3 logic to also take in a --profile parameter as used in aws-cli.

jmarshall commented 8 years ago

Yes, we'll soon support host_base in .s3cfg and some equivalent in .aws/config. I'm not really keen to invent new environment variables — are there other tools using such an environment variable?

At the moment you can select a profile to be used with s3://profile@bucket/…. What we actually need is some documentation of how the S3 URLs are parsed by htslib and what environment variables and configuration files are used when…

blajoie commented 8 years ago

Thanks!

No - that environment variable was a fictional creation on my part. Sticking to a well known standard is best. Or perhaps parsing additional information from the supplied s3 (default or named) profile?

I will play a bit with s3://profile@bucket/... syntax! Happy to provide additional feedback and or some basic documentation regarding htslib / s3 interaction.

sb10 commented 7 years ago

I can see in samtools 1.4 there is a hfile_s3.c that seems to have all the necessary code for getting the correct domain from ~/.s3cfg, but is it actually supposed to be working now?

I tried installing by:

sudo apt-get install gcc make zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libncurses5-dev -y
wget "https://github.com/samtools/samtools/releases/download/1.4/samtools-1.4.tar.bz2"
tar -xvjf samtools-1.4.tar.bz2
rm samtools-1.4.tar.bz2
cd samtools-1.4/htslib-1.4
autoheader
autoconf
./configure --enable-libcurl --enable-s3 --enable-plugins
make
cd ..
./configure
make
sudo make install

But when I try to view a file in s3 (that I can get with s3cmd):

samtools view s3://mybucket/my.cram chr20:61000-61100
samtools view: failed to open "s3://mybucket/my.cram" for reading: Permission denied

I don't have a ~/.aws/credentials file, which btw should not stop it reading ~/.s3cfg to get the host_base.

How do I debug this?

daviesrob commented 7 years ago

You can get more debugging output by using the htsfile program that comes with htslib. This should get you a lot of output:

htsfile -vvvvvv -c s3://mybucket/my.cram

Htslib doesn't understand v4 signatures yet. Depending on where your data is, this may explain the problem.

sb10 commented 7 years ago

I'm using Ceph Object Gateway, which sort of works with v4, but is best used with v2 signatures.

cd ~/samtools-1.4/htslib-1.4
./htsfile -vvvvvv -c s3://mybucket/my.cram
[M::load_hfile_plugins] loaded "knetfile"
[W::hts_path_itr] can't scan directory "/usr/local/libexec/htslib": No such file or directory
htsfile: can't open "s3://mybucket/my.cram": Protocol not supported

So I guess my first question is, how are you supposed to build samtools and get it to use your existing configuration/build of htslib in the sub dir? I rebuilt htslib:

make clean
autoheader
autoconf
./configure --enable-libcurl --enable-s3 --enable-plugins
make
sudo make install

And tried again:

$ htsfile -vvvvvv -c s3://mybucket/my.cram
[M::load_hfile_plugins] loaded "knetfile"
[M::hfile_gcs.init] version 1.4
[M::load_hfile_plugins] loaded "/usr/local/libexec/htslib/hfile_gcs.so"
[M::hfile_s3.init] version 1.4
[M::load_hfile_plugins] loaded "/usr/local/libexec/htslib/hfile_s3.so"
[M::load_hfile_plugins] loaded "/usr/local/libexec/htslib/hfile_libcurl.so"
*   Trying 52.216.65.144...
* Connected to mybucket.s3.amazonaws.com (52.216.65.144) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*    subject: C=US; ST=Washington; L=Seattle; O=Amazon.com Inc.; CN=*.s3.amazonaws.com
*    start date: Jul 29 00:00:00 2016 GMT
*    expire date: Nov 29 12:00:00 2017 GMT
*    subjectAltName: mybucket.s3.amazonaws.com matched
*    issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert Baltimore CA-2 G2
*    SSL certificate verify ok.
> GET /samtools_testing/NA20903.cram HTTP/1.1
Host: mybucket.s3.amazonaws.com
User-Agent: htslib/1.4 libcurl/7.47.0
Accept: */*
Date: Tue, 11 Apr 2017 12:42:20 GMT
Authorization: AWS E9Z5LE3GCLGXH8TZSPYG:gBAuzTe2847C+7pqnK/U/w86Xdg=

< HTTP/1.1 403 Forbidden
< x-amz-request-id: CBF3402A40624AD6
< x-amz-id-2: CLEX9uRSjpobDV5Rf2JDYStpBG4guQYz7t5ot4BUWXc5afT9tgP6lqErUr7H04//CVv/MmINvHY=
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Tue, 11 Apr 2017 12:42:20 GMT
< Server: AmazonS3
<
* Connection #0 to host mybucket.s3.amazonaws.com left intact
[W::sam_read1] parse error at line 1
htsfile: reading "s3://mybucket/my.cram" failed: Invalid argument

So, it's not reading my config file:

$ perl -ple 's/(_key =).*/$1/' ~/.s3cfg
[default]
access_key =
encrypt = False
host_base = cog.sanger.ac.uk
host_bucket = %(bucket)s.cog.sanger.ac.uk
secret_key =
use_https = True

$ ls -atlh ~/.s3cfg
-rw------- 1 ubuntu ubuntu 202 Apr  6 09:31 /home/ubuntu/.s3cfg
jkbonfield commented 7 years ago

Could you please test pulling in this PR https://github.com/samtools/htslib/pull/506 to see if it fixes the issue?

It's sounding much like you're hitting the same problem with s3cfg not being read.

jmarshall commented 7 years ago

Probably you also have a ~/.aws/credentials file, which htslib reads first and so has already provided settings to be used. And this very issue is about not being able to specify the endpoint in ~/.aws/credentials at present — we should just bless endpoint_url by fiat and fix this.

Remove the default section of ~/.aws/credentials or delete the file entirely. Alternatively use a distinctive profile that's defined in your ~/.s3cfg: introduce it with [foo] in the config file, and use URLs like s3://foo@mybucket/my.cram or set $AWS_PROFILE.

sb10 commented 7 years ago

My reading of the existing code is that it should be working already if I don't have a ~/.aws/credentials file (and I don't), but I'll give the PR a go...

jmarshall commented 7 years ago

@sb10: In that case, running under strace to see it opening and reading from ~/.s3cfg may be interesting…

sb10 commented 7 years ago
$ echo $AWS_SHARED_CREDENTIALS_FILE

$ ls -alth ~/.aws/credentials
ls: cannot access '/home/ubuntu/.aws/credentials': No such file or directory

Trying the PR:

cd ~/samtools-1.4/htslib-1.4
wget "https://raw.githubusercontent.com/blajoie/htslib/2230dc6f1b610a4be2fd869500d99ae784035e12/hfile_s3.c"
mv hfile_s3.c.1 hfile_s3.c
make clean
autoheader
autoconf
./configure --enable-libcurl --enable-s3 --enable-plugins
make
sudo make install

And tried again:

$ htsfile -vvvvvv -c s3://mybucket/my.cram
[worked]

... so somehow the PR helped, but I don't know why.

Remaking samtools then also made that work.

sb10 commented 7 years ago

Oh, the current code also doesn't parse the config files if the $AWS_ACCESS_KEY_ID env var is set, and I had it set. Hence the PR working.

jmarshall commented 7 years ago

Yes indeed, and sorry for the confusion. As noted previously on this issue, this S3 stuff needs documentation about how it works and is configured. Probably an HTML page on htslib.org rather than a man page.

While implementing it, it became apparent that mixing up config settings from different sources would be hard to implement, impossible to document, and confusing to use. So it takes the first of the following that provides access_key:

  1. URL like s3://ACCESS_KEY:SECRET[:TOKEN]@BUCKET/…

  2. Provided that there is no s3://PROFILE@BUCKET/… in the URL, it looks at environment variables: $AWS_ACCESS_KEY_ID, $AWS_SECRET_ACCESS_KEY, $AWS_SESSION_TOKEN.

  3. ~/.aws/credentials (or as specified by $AWS_SHARED_CREDENTIALS_FILE), looking at the profile specified in the URL or in $AWS_DEFAULT_PROFILE or $AWS_PROFILE or otherwise "default".

  4. ~/.s3cfg, looking at a profile as described.

  5. ~/.awssecret.

blajoie commented 7 years ago

Adding visibility from https://github.com/samtools/htslib/pull/506 discussion - it would also be helpful to add a way to force path_style/virtual_style urls. Our use-case is to always force path_style.

Something like:

~/.aws/credentials

[profile] aws_access_key_id = keyid aws_secret_access_key = secretkey endpoint_url = endpointurl url_mode = auto|virtual|path

wresch commented 7 years ago

We patch our htslib locally to force path_style URLs via a 'url_mode' key in .s3cfg so that we can access a local cleversafe store. It would be nice if this mechanism or one like it could be made official.

sb10 commented 6 years ago

I've hit this issue again: now that I had a ~/.aws/credentials file, it didn't work until I renamed it.

While implementing it, it became apparent that mixing up config settings from different sources would be hard to implement, impossible to document, and confusing to use.

I disagree. May I suggest you take a "it just works" approach that I use for muxfys:

S3ConfigFromEnvironment makes an S3Config with Target, AccessKey, SecretKey and possibly Region filled in for you.

It determines these by looking primarily at the given profile section of ~/.s3cfg (s3cmd's config file). If profile is an empty string, it comes from $AWS_DEFAULT_PROFILE or $AWS_PROFILE or defaults to "default".

If ~/.s3cfg doesn't exist or isn't fully specified, missing values will be taken from the file pointed to by $AWS_SHARED_CREDENTIALS_FILE, or ~/.aws/credentials (in the AWS CLI format) if that is not set.

If this file also doesn't exist, ~/.awssecret (in the format used by s3fs) is used instead.

AccessKey and SecretKey values will always preferably come from $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY respectively, if those are set.

If no config file specified host_base, the default domain used is s3.amazonaws.com. Region is set by the $AWS_DEFAULT_REGION environment variable, or if that is not set, by checking the file pointed to by $AWS_CONFIG_FILE (~/.aws/config if unset).

To allow the use of a single configuration file, users can create a non- standard file that specifies all relevant options: use_https, host_base, region, access_key (or aws_access_key_id) and secret_key (or aws_secret_access_key) (saved in any of the files except ~/.awssecret).