samtools / htslib

C library for high-throughput sequencing data formats
Other
811 stars 447 forks source link

Obtain S3 credentials via IAM roles #344

Open jmarshall opened 8 years ago

jmarshall commented 8 years ago

In addition to the use case where a session token is provided as part of a Federated access model, probably the more common use case is for using IAM roles in combination with EC2 instances. Briefly, this makes temporary credentials available to the instance at a local IP address obviating the need to ever put long term security credentials on an instance. s3cmd supports this method of key delivery and will 'look' for the availability of these keys out of the box.

[As noted by @obenshaindw on PR #303]

jmarshall commented 8 years ago

I see botocore also falls back to contacting the IAM well-known endpoint. HTSlib could also do this, although I would have thought there would be some common script for copying these into environment variables so that such code was not multiplied across clients and SDKs…? e.g. eval something like

baseurl='http://169.254.169.254/latest/meta-data/iam/security-credentials'
role=`curl $baseurl`
curl $baseurl/$role | jq -r "\"export AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\""

HTSlib-based tools (i.e. samtools) are used on EC2 and especially outside EC2. If you're not on EC2, contacting random link-local IP addresses and asking them for credentials would be rather unfortunate (to put it mildly), so I'd really rather not do this by default. I would have thought there would be a convention of only doing this when $AWS_PROFILE is set to a special :IAM value or something like that?

obenshaindw commented 8 years ago

Any script that reads the instance metadata for security credentials and sets environment variables for those would need to be run on a schedule or immediately prior to running samtools (at least every hour according to the previously linked page AWS documetation because credentials are automatically rotated.

The advantage here is not having to leave any credentials on the instance, and if using an application that reads directly from instance metadata, not having to set anything environment-specific.

I agree that this would seem like a special case for samtools (i.e., a piece of code that is environment-specific). It makes sense for boto to have this functionality since it is an SDK for Amazon Web Services. As you say, if one is using samtools on EC2 they could write a script to read from the instance metadata endpoint and place the security credentials in a location where samtools will find them.

delagoya commented 6 years ago

The AWS C++ SDK instead of the current customer request signing code would solve this. But that is a big project and would likely be better done by pulling out the htsfile_s3.c plugin from the root source tree and into it's own contrib project.

In the meantime, I've built a Docker container that has S3 protocol support and also a wrapper script that queries the metadata and then calls samtools. This should work for both EC2 instances and ECS Tasks. The wrapper script contents are simply:

#!/bin/bash
B='http://169.254.169.254/latest/meta-data/iam/security-credentials/'
CREDS=''
if [ -n "${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}" ]; then
    CREDS=$(curl -s --connect-timeout 0.1 -s 169.254.170.2${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI} |  jq -r "\"AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\"")
elif [ -n "$(R=$(curl -s --connect-timeout 0.1 $B))" ]; then
    CREDS=$(curl -s --connect-timeout 0.1 $B/$R |  jq -r "\"AWS_ACCESS_KEY_ID='\(.AccessKeyId)' AWS_SECRET_ACCESS_KEY='\(.SecretAccessKey)' AWS_SESSION_TOKEN='\(.Token)'\"")
fi
eval ${CREDS} samtools-s3 $*

Docker container is at https://hub.docker.com/r/delagoya/samtools-ecs-s3/

I've tested it out on AWS Batch to pull the header from a private S3 BAM file object.

hguturu commented 4 years ago

Are there any updates to if S3 credentials via IAM roles will be considered? The workaround has trouble since the credentials are rotated so any jobs that run for more than a few minutes can suddenly fail in the middle.

delagoya commented 4 years ago

can you give an example of a failure? The code above should pull fresh credentials every time the docker container is executed.

hguturu commented 4 years ago

The curl command above curl -s --connect-timeout 0.1 $B/$R returns an object with a field such as "Expiration" : "2020-04-14T01:20:09Z". This expiration is a fixed time from when the credentials were last refreshed. So if the command is launched 10-15 minutes before the tokens expires and runs for more than 15 minutes, it breaks in the middle since its tokens have expired. I assume the SDK handles this by refreshing tokens as they expire. Its hard to reproduce since the command itself works fine, just the token refresh timing.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials seems to suggest new tokens are given 5 minutes in advance, so there is a tight window of failure cases, but the window gets longer when streaming large files from s3 (e.g. a 200Gb+ bam file).

daviesrob commented 4 years ago

Is the problem caused by trying to do lots of index look-ups, or by having a very long delay between generating the credentials and actually trying to open the file?

The first one isn't too easy to fix, as currently htslib only reads the credentials on S3 file open (although it does know how to re-sign the request if necessary). The second problem might not be so difficult. Instead of passing the credentials in environment variables, as shown above, you could store them in one of the configuration files supported by hfile_s3. Then you could run a program that occasionally requests a new token and writes it into the configuration file (ensuring you do this safely, i.e. write to a new file and then rename it on top of the old one). That way, when hfile_s3 tries to get the file, it should always have a token with a reasonable amount of lifetime left.

hguturu commented 4 years ago

Ya its unfortunate the first type of use case doesn't have an easy fix with this workaround. I think the use case also applied with larger streams where they can be cut off midstream. But I guess the only fix would be to update hfile_s3 to obtain those credentials so it knows when they are invalid and refresh or switch to the sdk which can handle credentials for you, either of those are bigger tasks, which is why I was wondering if there are any plans for that update?

daviesrob commented 4 years ago

Looking at this, I think the best solution would be to add the AWS JSON metadata format to the list of configuration files the hfile_s3 can use. It can then extract the expiration time and use it to renew the credentials if it needs to.

I'm not sure exactly when we can get it working, but we can certainly put it on out to-do list.

hguturu commented 4 years ago

It would be great if it gets added to the to-do list! Thanks!

markotitel commented 2 years ago

Hi, I'm curious if you guys still consider this. It would be very useful.

whitwham commented 2 years ago

It is still on the list of things to do. Unfortunately it is not that high up on the list at the moment.

bioIT-UZA commented 2 years ago

I just want to express my interest in this as well. samtools in aws batch + cromwell for genome data now requires a lot of localization (copying) that can be prevented if samtools could use the iam-role when accessing s3 data.

cariaso commented 11 months ago

When handling non-public data this value of this feature is substantial. it has been allowed to linger for more than 7 years and remains important, as demonstrated by a steady stream of comments and upvotes over the years.

I thought this solution from igv.js was notable https://github.com/igvteam/igv.js/issues/1709 and should be linked to in this thread. Besides it's technical value, its a clear message that other parts of the bioinformatics community rely on s3.

I'd like to request that the "Low Priority" tag be removed from this issue.

daviesrob commented 11 months ago

Hmm, I obviously forgot to tag this ticket when working on PR #1462, which added support for refreshing credentials like the ones you get from IAM. It turns out that there are a few ways of getting credentials out of AWS, which all have subtle (or not-so-subtle) differences. Rather than try to support them all in HTSlib, it's left to an external script to get the credentials and save them in a format that HTSlib can use.

A simple IAM example can be found in the htslib-s3-plugin manual page, although for production use you might want something a bit more robust. Basically you just run it in the background and it grabs the IAM credentials and saves them in .aws/credentials format for use by HTSlib. It also adds a slightly-unofficial expiry_time key, which tells HTSlib when it may need to re-read the file. The script wakes up and replaces the file well before the expiry so that the stored credentials should always be fresh. (Note also that it takes care to replace the file atomically so that HTSlib will never see a half-written copy). Hopefully it shouldn't be too difficult to integrate something like this into your workflows.

markjschreiber commented 9 months ago

The simplest way for HTSLIB to do this would be to use the AWS C++ SDK, that way you get all of the logic of the AWS credential chain, sigV4 implementation and APIs that construct the calls to the service endpoints.

It would require some wrapper code to extern the relevant calls of the C++ lib but probably easier (and a lot safer) than attempting to implement the credential chain yourself. Example at, https://stackoverflow.com/questions/56842849/how-to-call-aws-cpp-sdk-functions-from-c

Use of a credentials file is not recommended for anything other than a local machine (laptop etc) and even then it should only be using temporary credentials. Use of a credentials file in a container, on an EC2 etc is not recommended, especially ones with long lived credentials. This is where Roles should be used exclusively.

jmarshall commented 9 months ago

When this code was originally written, I investigated using the AWS SDK rather than the AWS protocol documentation. HTSlib is a C library used largely in environments without strong sysadmin skills or abilities to manage dependencies effectively. Adopting a multi-million line dependency written in C++ was a non-starter.

markjschreiber commented 9 months ago

I can see that, it will certainly add some overhead.

The use of the lib in environments without strong sysadmin skills is part of the reason that I am worried about the current requirement for credentials files. People may not be to aware of the risks they are taking especially as I know many users will probably be running with fairly privileged IAM Roles that are, or are close to, Admin.

I would imagine the most user friendly way to do it would be to install from a package manager https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/sdk-from-pm.html and only install the relevant service packages (to avoid the multimillion lines of code) and then check for them using your install script in the same way you do for other dependencies. Certainly requiring someone to build the lib along with samtools is not something most would want to attempt.

It's certainly not no-work but might be easier to maintain in the long term and gives you easy access to the full S3api and full access to all aspects of the credential chain which would make it easier to use in more exotic scenarios like supporting users with SSO provided credentials, containers on Kubernetes etc.

jmarshall commented 9 months ago

…Or to get around to locally implementing IAM role parsing e.g. from a well-known endpoint, as discussed on #303. Implementing that is one of the reasons HTSlib acquired a JSON parser, and I may have a partial draft still lying around from 2016 somewhere…

jkbonfield commented 9 months ago

As the debate has been opened, my twopennies worth.

My heart sinks whenever I'm attempting to install something that has a dependency on vast C++ libraries. I don't know anything about the AWS one and I'm sure it's fine, but far too many are not and I've wasted hours battling cmake and its bizarre choices of which C++ compiler and library versions it wants to use. I seem to have a disproportionate rate of failure with such things. Yes I could install binaries instead, but that's not always an option depending on the environment, and we support a wide range of platforms for which binaries may not be available.

Htslib aims to be small and lightweight, with a minimal number of dependencies. Where required due to complexity it's fine (eg we're really not going to rewrite lzma or the various encryption libraries), but still we've limited ourselves to code written in the same language as htslib itself (even to the extent of porting a little bit here and there). I think this is sensible.

I've previously explored linking C++ from C and while it's possible and not even that hard for an executable, it has knock on effects to every user of a library, and their environments may well be very different. You may even need users of htslib to start linking with C++ too, if there are static classes with constructors in use. Note some use htslib in other libraries, and some end up in other language bindings. The downwards dependency chain is vast and by switching languages we're making demands on all of that dependency chain too. So it's a "write once, test everywhere" scenario that is simply best avoided.

We could migrate the S3 code into a plugin and avoid some of the language shenanigans, but that's also a bit messy and it feels a bit like the tail wagging the dog. In this case, the extra code to support this ourselves is potentially less of a maintenance burden than attempting to use the official library, so it's worth investigating that route first.

If you wish, consider this as feedback: C is much more portable than C++, so a lightweight cut-down AWS C API would have an appreciative audience.

Regarding security, I do agree that avoiding credentials files would be preferable.

daviesrob commented 9 months ago

Actually, it turns out that there is a C library for authentication which I found two submodules down from the C++ API. Other C interfaces to AWS can be found here; the auth library depends on some of them which unfortunately makes it non-trivial to use. The Apache license everything comes under might cause us problems too, depending on how we linked everything together.

One of the reasons why I didn't go for reading the JSON directly was that there's now so many interfaces for getting credentials on AWS, which all work with different formats. This can be seen in the C library where functions can be found for all the different options. It seemed easier to delegate this complexity to an external program. I do note that there's a comment suggesting the more popular options so we might be able to get away with just concentrating on those.

It would also be fairly easy to make an external module that used the C++ API in the manner as the ones in htslib-plugins. If given a suitably high priority it would override the built-in S3 handler, allowing for an easy replacement.

markjschreiber commented 9 months ago

Apache 2 is fairly permissive so I think you will OK to link to it. It doesn't have the so called "viral" properties of GPL and the like. If you had to copy the code that part would likely need to remain Apache 2 so best as a module/ plugin.

I agree that the real goal is to find a way to have the most pain free way of dealing with the multitude of authorization options supported by AWS credentials. Probably that means delegating to an external implementation as long as it doesn't bring too many complexities. A plugin seems like a nice way to go and keeps the main library light weight for people who don't need S3 support.

Speaking from personal experience, being able to use HTSLIB and Samtools from an EC2 or container on ECS reading directly from S3 would solve a lot of headaches with staging/ unstaging objects to block storage.

markjschreiber commented 6 months ago

As well as a C lib for Auth there is also a C lib for some core parts of S3 functionality. These C libs are part of what AWS refers to as the "Common Runtime Libraries".

By using the common auth lib you would get the standard credential configuration chain and be able to support all the various ways Users, Instances and Containers can be provisioned with credentials. The standard S3 library would give you great support for handling multi-part uploads and downloads, memory optimization etc.

I think these would be preferable to the more heavy weight C++ SDK (which like all our SDKs, builds on this core).

fitzlane86 commented 3 months ago

The use of IAM roles (temp credentials) is strongly recommended by AWS as security best practice. Expect this to become a bigger issue for AWS users as time goes on.

markjschreiber commented 3 months ago

I have also found scenarios where using htslib based applications in containers on ECS fails to obtain credentials from the service in the recommended way. The multiple ways in which credentials can be obtained and the order in which they are tried is a very strong reason to use the implementation in the C common runtime library. This is now the basis of all SDK implementations and is the easiest way to get complete and correct support for credentials.

markjschreiber commented 1 month ago

@daviesrob @jkbonfield - what would be involved in creating a module that could check credentials and maybe even handle S3 requests using the relevant AWS CRT c libs? By switching to these libs you would unburden yourself from needing to remain aware of an compatible with the multitude of ways AWS can verify credentials as well as more easily stay up to date with new S3 features you might want. Also the AWS CRT s3 libs are highly optimized for transfer which might speed up htslib when it interacts with S3.

jkbonfield commented 1 month ago

I don't know the ins and outs of the modules / plugins so I'll let Rob answer that, but it could be a useful addition.

However I think there are plans to have a large reorganisation of the base https layer to automatically handle things like reconnects, so this may impact on the layers above them such as s3, gcs, etc. I don't know enough about the networking side of htslib to know how this would impact on your proposal.

These dependencies dwarf htslib itself for size, but that may not be an issue if they're widely available as existing OS packages (as we do with things like libcurl which in turn uses e.g. openssl, and so on). I note that conda has things like aws-c-auth or aws-c-common libraries bundled up as a pacakge, but I couldn't find them in Ubuntu. Similarly for the s2n-tls prerequisite. I'm not sure why this is, but having dependencies on packages which are non-standard adds a considerable burden on people building our code, if not makes it impossible. Eg maybe ubuntu have explicit policies forbidding them from including certain types of code. (We have also had political issues with htslib including crypto code directly, which is why crypt4gh is in a separate module.) It does concern me that they're apparently not widely adopted by the mainstream distros. Can you comment as to why this may be?

Using a plugin resolves a lot of this complexity. Eg see the hfile_irods in https://github.com/samtools/htslib-plugins which hides a multitude of sins and complexities. However I don't know if the current plugin system is sufficient to replace things all the way down to the https TLS layer. If it is, then you could start building and testing your own plugin to htslib that enables a new aws: access protocol for example.

jmarshall commented 1 month ago

However I don't know if the current plugin system is sufficient to replace things all the way down to the https TLS layer.

It certainly is, and it would be a good way to insulate htslib and programs using it from needing to know about any large AWS dependencies.

jkbonfield commented 1 month ago

Thanks John. In that case, I suggest to @markjschreiber to create a plugin as a demonstration, and we can consider how to proceed from there. Eg adopting it into our own plugins, or adding it as a supported third-party plugin in the build system.

It doesn't necessarily obsolete this issue though, if a simple work around could be found so a self-contained htslib still works out of the box.

markjschreiber commented 1 month ago

Sounds like a worthy challenge and time to brush up on my C. @jmarshall is there any docs/ guidance on how plugins work for htslib and how to make one?

The work around for a more self contained htslib would involve updating the methods used to retrieve credentials so that they cover more (or all) of the ways that you can obtain credentials. Most importantly this would need to cover being able to use the EC2 metadata and ECS metadata endpoints so that you can obtain credentials from EC2 and container roles respectively. You would also need to check credentials for an expiry time and only cache them until (or before) that time. Once expired the lib should automatically attempt to acquire new credentials. The downside of rolling your own would be the burden of staying up to date with the AWS IAM credentials process and all their various options. You also need to maintain a Sigv4 implementation which you probably know is tricky to get right.

daviesrob commented 1 month ago

Unfortunately the plugin interface isn't too well documented, but you could take a look at one of the simpler ones, for example htslib-plugins/hfile_mmap.c, although note that these don't support the vopen() interface. You also need to define a couple of structures that currently live in hfile_internal.h ... we should really get around to making a public header with these in so that making plugins is a bit easier...

Anyway, you'll need to define these structures:

struct hFILE_backend {
    /* As per read(2), returning the number of bytes read (possibly 0) or
       negative (and setting errno) on errors.  Front-end code will call this
       repeatedly if necessary to attempt to get the desired byte count.  */
    ssize_t (*read)(hFILE *fp, void *buffer, size_t nbytes);

    /* As per write(2), returning the number of bytes written or negative (and
       setting errno) on errors.  Front-end code will call this repeatedly if
       necessary until the desired block is written or an error occurs.  */
    ssize_t (*write)(hFILE *fp, const void *buffer, size_t nbytes);

    /* As per lseek(2), returning the resulting offset within the stream or
       negative (and setting errno) on errors.  */
    off_t (*seek)(hFILE *fp, off_t offset, int whence);

    /* Performs low-level flushing, if any, e.g., fsync(2); for writing streams
       only.  Returns 0 for success or negative (and sets errno) on errors. */
    int (*flush)(hFILE *fp);

    /* Closes the underlying stream (for output streams, the buffer will
       already have been flushed), returning 0 for success or negative (and
       setting errno) on errors, as per close(2).  */
    int (*close)(hFILE *fp);
};

struct hFILE_scheme_handler {
    /* Opens a stream when dispatched by hopen(); should call hfile_init()
       to malloc a struct "derived" from hFILE and initialise it appropriately,
       including setting base.backend to its own backend vector.  */
    hFILE *(*open)(const char *filename, const char *mode);

    /* Returns whether the URL denotes remote storage when dispatched by
       hisremote().  For simple cases, use one of hfile_always_*() below.  */
    int (*isremote)(const char *filename);

    /* The name of the plugin or other code providing this handler.  */
    const char *provider;

    /* If multiple handlers are registered for the same scheme, the one with
       the highest priority is used; range is 0 (lowest) to 100 (highest).
       This field is used modulo 1000 as a priority; thousands indicate
       later revisions to this structure, as noted below.  */
    int priority;

    /* Fields below are present when priority >= 2000.  */

    /* Same as the open() method, used when extra arguments have been given
       to hopen().  */
    hFILE *(*vopen)(const char *filename, const char *mode, va_list args);
};

struct hFILE_plugin {
    /* On entry, HTSlib's plugin API version (currently 1).  */
    int api_version;

    /* On entry, the plugin's handle as returned by dlopen() etc.  */
    void *obj;

    /* The plugin should fill this in with its (human-readable) name.  */
    const char *name;

    /* The plugin may wish to fill in a function to be called on closing.  */
    void (*destroy)(void);
};

You also need to declare a structure with hFILE base; as its first member which will be returned when you open a file.

You should create a const struct hFILE_backend with your read, write etc. function pointers in it. This is used to populate base.backend in your file handle.

Your plug-in should export the entry-point int hfile_plugin_init(struct hFILE_plugin *self). This should populate self->name and optionally self->destroy, fill out a const static struct hFILE_scheme_handler and pass it back by calling hfile_add_scheme_handler("s3", &handler). If you set hFILE_scheme_handler::priority greater than 50 (or 2050 if you want to support vopen()) then you will override the default hfile_s3 plug-in for handling s3:// URLs.

Your open() function should call hfile_init() to create a file handle. The first argument is a size, which is used to reserve the extra space needed by your internal structure. You populate any fields needed by your internal structure along with base.backend and then return a pointer to base.

vopen() should you need it is similar to open(), apart from accepting a va_list so you can add extra arguments (e.g. in hfile_libcurl where it's used to set additional headers, among other things). It will only be called if you set your hFILE_scheme_handler::priority greater than 2000.

The read(), write() etc. callbacks are all passed an hFILE *. This is cast into your internal type to get access to the full structure.

Finally, the close() callback should clean up anything made by your plug-in, and return 0 on success, or less than zero on error. It should not try to free the hFILE structure, as this is done by the caller.

If you build HTSlib with --enable-plugins and point HTS_PATH at the location of your plug-in (or drop your .so file into libexec/htslib in an HTSlib install tree), HTSlib should find it and attempt to use it to access your files.