iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.68k stars 1.18k forks source link

fs: explicitly declaring the possibility of falling back to anonymous login #5797

Open isidentical opened 3 years ago

isidentical commented 3 years ago

Currently, there is no way of knowing whether a user intends to login with anonymous login (less likely) or whether they forget to add other required credential fields (more likely). For the clarity on error messages, and also for us to better work with this stuff I think it would be nice to add a new config option to adlfs and gcsfs (and then s3fs) regarding explicit authorization for anonymous login, called allow_anonymous_login.

We could interpret in 2 different ways, when considering a case where allow_anonymous_login is given with 2 other options account_name (mandatory), account_key (can be used with account_name to create a different authentication method);

CC: @shcheklein @efiop @jorgeorpinel

efiop commented 3 years ago

@isidentical Could you share more info? What is anonymous login? How does that look/work with things like awscli?

isidentical commented 3 years ago

@isidentical Could you share more info? What is anonymous login?

It simply is a way of accessing public buckets without giving away any credential information. I never used it though I'd assume one use case would be downloading a public dataset without having an aws account (some still require even though they are public, for make you pay the transfer costs, but this is different use case).

How does that look/work with things like awscli?

It seems something like this; https://stackoverflow.com/a/35978867, which is close to what s3fs does https://github.com/dask/s3fs/blob/753ee7bbf6d3cf1dd258f6827fcb9126d5c2bbe8/s3fs/core.py#L349-L365 (no-sign = UNSIGNED as signature_version, i presume)

efiop commented 3 years ago

@isidentical Oh, interesting. I didn't know it was possible in the first place. We are usually just using https for that (e.g. our default remote in this repo is pointing to s3 bucket through https).

isidentical commented 3 years ago
  • We could check whether there is anything available besides anonymous login, and prioritize it. If there is nothing, fallback to anonymous login
  • We could disregard any other login information if the allow_anonymous_login is set to true so that it would give ability to the users to temporarily check whether their remote accepts/works with anonymous login (if that might be a use case, though I dont think so)

What do you think about the possible interpretations of such a flag @efiop. Should we check whether there are any other information we can leverage to authenticate even though the user allows anonymous auth, or solely try to connect with anonymous auth by omitting all other config options?

efiop commented 3 years ago

@isidentical I agree with you that it is better to have an explicit config option for that, similar to --no-sign-request in awscli.

jorgeorpinel commented 3 years ago

Hey sorry, late to this party.

For the clarity on error messages, and also for us to better work with this stuff

Is is possible to improve the messaging without adding an extra steps for users @isidentical ? What other advantage does the user get from this? Thanks

better to have an explicit config option for that, similar to --no-sign-request in awscli

p.s. that prevents AWS from loading the default credentials I think, which is a little different (as @efiop implied). Whether it will still fall back to anonymous access for public buckets is probably not defined by that option.

isidentical commented 3 years ago

What other advantage does the user get from this? Thanks

We don't know whether a user simply forgot to add necessary information like account_key to their config or just meant anonymous login (it seems like a bit unorthodox to assume that a remote with anonymous login by default is added, since the use case can be generally satisfied by get-url/import-url with one-off operations). This would allow us to raise better error messages, for example like the ones in the #5833

jorgeorpinel commented 3 years ago

Sure, I get that. But what's the benefit to the user? Sounds like a benefit for DVC (makes its implementation easier), which is great, unless we're making anonymous auth harder to use, or even to realize that it's available.

the use case can be generally satisfied by get-url/import-url

Those are very different features IMO. And HTTP remotes have anonymous auth by default which is intuitive, why not public buckets? BTW will you need to remote modify an HTTP remove to set allow_anonymous_login before being able to connect?

This would allow us to raise better error messages

Maybe we can still improve the messages without adding this extra step? Idk, haven't looked into the code obvi so up to you. Not a huge deal either way but in principle I think if the user loses flexibility we should think this twice (maybe ask @dberenbaum). Thanks

dberenbaum commented 3 years ago

Is the question whether we should always try anonymous login when other methods fail (vs. requiring allow_anonymous_login = true)?

jorgeorpinel commented 3 years ago

I thin this mainly relates to Azure and/or S3 remotes which already fall back to anonymous access (if the bucket is public).

isidentical commented 3 years ago

But what's the benefit to the user?

Most of the time, they will get a proper error message. We could include allow_anonymous_login= in those error messages to make it usable to user too, which I guess satisfies both of the needs, what do you think @jorgeorpinel?

And HTTP remotes have anonymous auth by default which is intuitive, why not public buckets?

This is not going to be a common option for all remotes, but rather we'll use the same convention for the ones that actually support anonymous login but not as a first-class option (unlike HTTP).

Not a huge deal either way but in principle I think if the user loses flexibility we should think this twice

AFAIK it wasn't even possible to do anonymous login with Azure before DVC 2.0, so I don't think we are restricting the users in any way, but just making the whole authentication operation more solid.

isidentical commented 3 years ago

Is the question whether we should always try anonymous login when other methods fail (vs. requiring allow_anonymous_login = true)?

Yes. Should we warn users if they forget to have secondary config values (like account_key) or just let it slide if the user explicitly declares that they might run into anonymous_login situation.

jorgeorpinel commented 3 years ago

As you guys prefer, @isidentical. I just wanted to express my PoV in case it was useful. I'm still wondering if there's a way to improve messaging without needing the option though.

We could include allow_anonymous_login= in those error messages

TBH I didn't quite get the suggestion but it sounds like it could help the UX so 👍

dberenbaum commented 3 years ago

I think the explicit option here makes sense. Unlike http, login to s3 and azure requires credentials by default (see the stack overflow link above).

@isidentical Do you still have a question about what the behavior should be for allow_anonymous_login? IMO we should follow the ui of those remotes. For example, if you use no_sign_request in awscli, does it ignore credentials?

shcheklein commented 3 years ago

(no opinion on my end, but may be it worth taking a look at how tools like rclone do this)

isidentical commented 3 years ago

rclone seems like using anonymous login when both of the credentials are blank: https://rclone.org/s3/#anonymous-access-to-public-buckets

dberenbaum commented 3 years ago

Here's some background on the s3fs and rclone implementations for anonymous login and authentication:

s3fs

https://github.com/dask/s3fs/blob/753ee7bbf6d3cf1dd258f6827fcb9126d5c2bbe8/s3fs/core.py (code) https://github.com/dask/s3fs/pull/51 (PR) https://github.com/dask/dask/issues/1178 (related issue)

The parameter description gives a nice summary:

Parameters
----------
anon : bool (False)
    Whether to use anonymous connection (public buckets only). If False,
    uses the key/secret given, or boto's credential resolver (client_kwargs,
    environment, variables, config files, EC2 IAM server, in that order)

rclone

https://rclone.org/s3/#anonymous-access-to-public-buckets (docs for anonymous login) https://rclone.org/s3/#authentication (docs for auth generally) https://github.com/rclone/rclone/issues/154 (issue)

The Authentication section of the docs provides a summary:

Authentication

There are a number of ways to supply rclone with a set of AWS credentials, with and without using the environment.

The different authentication methods are tried in this order:

Directly in the rclone configuration file (env_auth = false in the config file):
    access_key_id and secret_access_key are required.
    session_token can be optionally set when using AWS STS.
Runtime configuration (env_auth = true in the config file):
    Export the following environment variables before running rclone:
        Access Key ID: AWS_ACCESS_KEY_ID or AWS_ACCESS_KEY
        Secret Access Key: AWS_SECRET_ACCESS_KEY or AWS_SECRET_KEY
        Session Token: AWS_SESSION_TOKEN (optional)
    Or, use a named profile:
        Profile files are standard files used by AWS CLI tools
        By default it will use the profile in your home directory (e.g. ~/.aws/credentials on unix based systems) file and the "default" profile, to change set these environment variables:
            AWS_SHARED_CREDENTIALS_FILE to control which file.
            AWS_PROFILE to control which profile to use.
    Or, run rclone in an ECS task with an IAM role (AWS only).
    Or, run rclone on an EC2 instance with an IAM role (AWS only).
    Or, run rclone in an EKS pod with an IAM role that is associated with a service account (AWS only).

If none of these option actually end up providing rclone with AWS credentials then S3 interaction will be non-authenticated (see below).


In summary, s3fs requires anon to be true for public buckets, in which case it will not try any credentials. If anon is false (the default), s3fs will look for credentials for every other authentication method.

rclone will fall back to anonymous login if no credentials are provided for any other auth method. However, if env_auth is false (the default), rclone will not look for credentials for auth methods outside of the rclone config (so it will ignore environment variables, ~/.aws/credentials, and IAM roles).

In the end, both require some kind of flag to differentiate auth methods. As explained in https://github.com/dask/dask/issues/1178, this is needed for cases like an EC2 instance with an IAM role. In that case, s3 access may be available through the IAM role (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html), but there is no easy way to differentiate between this circumstance and anonymous login. We could probably try IAM authentication first and then fall back to anonymous login here, but it seems potentially expensive if it's is the only way to use an anonymous bucket.

I think it still makes sense to have a flag for anonymous login or something similar, and the behavior should be (since this parallels s3fs and rclone):

* We could disregard any other login information if the `allow_anonymous_login` is set to true so that it would give ability to the users to temporarily check whether their remote accepts/works with anonymous login (if that might be a use case, though I dont think so)

I have no idea whether the same logic should apply to Azure.

jorgeorpinel commented 3 years ago

have a flag for anonymous login or something similar, and the behavior should be ...

I'd rename it to something like anonymous_auth in that case, as "allow" would be misleading if it's more of an auth mode switch.

And for consistency I think it makes sense to have the same behavior for any other remote that supports public "anonymous" (non) auth — referring to Azure

dberenbaum commented 1 year ago

@efiop Do you know what it would take to implement this? See https://iterativeai.slack.com/archives/C01R00PPQ1L/p1678361774191489 for context.

efiop commented 1 year ago

@dberenbaum You are talking about s3 specifically, right? we have that supported in azure with allow_anonymous_login dvc config option and s3fs/gcsfs have similar options that we just need to add to dvc config and pass through to corresponding implementations. There are some questions about naming, but the rest is straightforward.