nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
396 stars 79 forks source link

[proposed enhancement] AWS us-west-2 checking method #231

Open battistowx opened 1 year ago

battistowx commented 1 year ago

In some of our DAAC notebooks, we include the following Boto3 snippet to check if the notebook is being executed inside us-west-2, and throws a ValueError (with emojis) if you are not, preventing the notebook from being fully executed:

if (boto3.client('s3').meta.region_name == 'us-west-2'):
    display(Markdown('### us-west-2 Region Check: ✅'))
else:
    display(Markdown('### us-west-2 Region Check: ❌'))
    raise ValueError('Your notebook is not running inside the AWS us-west-2 region, and will not be able to directly access NASA Earthdata S3 buckets')

It may be useful to include a method that the user can call to check if they are in us-west-2 for direct S3 access and will throw an error like this, possibly using an fsspec transitive dependency, or the existing authorization checks.

betolink commented 1 year ago

This sounds great @battistowx we should pair this week or the next one if you have time, it should be a simple thing to implement with your code!

battistowx commented 1 year ago

Found methods to use requests to grab the current instance's region from the locally-linked Instance Identity Document, and it works great! https://gist.github.com/doublenns/7e3e4b72df4aaeccbeabf87ba767f44e

Wondering if this should be its own function that a user could call anywhere, or if it should be called only when an S3 granule is opened, which would print a detailed exception statement.

JessicaS11 commented 9 months ago

I'd be interested in helping implement this in some fashion - we're updating related functionality in icepyx and I'd love to just put it upstream and then use it.

betolink commented 9 months ago

I think it would be great if we can expose earthaccess.__store__.in_region to earthaccess.in_region(default="us-west-2") and perhaps using an nicely formatted repr output besides the boolean value?

JessicaS11 commented 8 months ago

Could someone clarify for #424:

was the spirit to check specifically for us-west-2, or to enable the user to see what region they are running in?

Thanks!

chuckwondo commented 2 weeks ago

Perhaps a silly question, but why do we need to make such an explicit check? I ran across another thread somewhere (I can't find it at the moment) where it seems that this is a hard problem (impossible in some cases?) to get such a check to work reliably/confidently in various environments.

What specifically is such a check to help a user do or not do or avoid?

jhkennedy commented 2 weeks ago

A lot of this has been discussed in #444.

@JessicaS11 my general understanding is that we want to provide an up-front check to confirm the "direct access tokens" requested by an Earthdata Login user will work because you can request and receive them from anywhere (locally/on-prem, another cloud, in a different region than the, in the same region as the data etc.) but you'll get an Access Denied error unless you are in the same AWS region as the data. Many users we work with don't really understand the cloud in general, let alone the specifics of AWS regions, and "looking before you leap" (LBYL) is helpful for users. This kind of check could be done right up front in a notebook so users don't waste time on something that's ultimately not going to work. Is that correct, or did I miss anything?

Sorry to be a downer here, but I'm now generally of the opinion that we should not provide a LBYL method/functionality like this at all. There are many challenges in doing this well, as described in #444, starting with https://github.com/nsidc/earthaccess/issues/444#issuecomment-1930731605. As the discussion has gone on, more problems and edge cases have come up.

I think for this, we'll be stuck with an easier-to-ask forgiveness-than-permission (EAFP) pattern, and time would be best spent on robust error handling/warnings so users understand what's happening. And really, for the high-level methods in earthaccess (e.g., smart_open), users don't actually have to care about where the data is other than it might be slow to work with the data in their location.

That said, there's one possible way to provide a LBYL method: actually open a file and see if you end up with a signed S3 URL in the region you're interested in (https://github.com/nsidc/earthaccess/issues/444#issuecomment-1940489666), which should probably be a granule-level method (you need something to open); something like granule.check_region('us-west-2') or granule.confirm_direct_s3_access(). There are a lot of other things that can go wrong in this process (e.g., EULAs, outages, etc.), so we especially want to handle errors well with this method.

A high-level LBYL method could be implemented if we knew of good "canary" files or hosted canary files in every region (or at least the region(s) of interest), but that's a lot of infrastructure to maintain and potential cloud costs we don't have a clear way to pay for.

chuckwondo commented 2 weeks ago

@jhkennedy, thanks for jumping in and supplying the link to the thread I was looking for.

I'm glad to hear that you are in favor of not trying to do this (if I understand your comment correctly -- if not, please correct me) precisely because of the difficulties involved. This is why I asked the question, because I also feel that we should not be trying to implement such a LBYL function/method for the user.

Instead, as you mentioned, we should do our best to produce a helpful error message with guidance for the user, to minimize (or eliminate) a great deal of head-scratching.

itcarroll commented 2 weeks ago

@jhkennedy @chuckwondo

Could one of you clarify how earthaccess.open would work without the LBYL/in-region check? Would it always first attempt direct access and fall back on https? Would the user have to enable logging to see how they are opening a granule, or know how to check the type of opened file? I understand the challenges of implementing the check, but want to understand the proposed UX without such a check. I'm skeptical users will be happily agnostic; I think they want confirmation they're doing the cloud thingy correctly.

A thought: if we fix the cloud_hosted filter so that it's useful (#565), we could use that as a user input on whether or not to raise an error when direct access fails.