nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
403 stars 81 forks source link

Change us-west-2 check, as only working in EC2 instances #444

Open sabinehelm opened 7 months ago

sabinehelm commented 7 months ago

In pull request #424 API access to store._running_in_us_west_2() is added in the form of a printed statement that the user is (or is not) running in AWS region us-west-2. Unfortunately the check in store._running_in_us_west_2() only works for EC2 instances. It is not working for example for ECS (Elastic Container Service) instances running in region us-west-2. Now the big question is: is it intended that only EC2 instances in region us-west-2 can access the data OR can it be any computing instance running in region us-west-2.

In the second case store._running_in_us_west_2() could be adapted using boto3 for checking the region following the code snippet in issues #231 :

if (boto3.client('s3').meta.region_name == 'us-west-2'):
    return True
else:
    raise ValueError('Your notebook is not running inside the AWS us-west-2 region, and will not be able to directly access NASA Earthdata S3 buckets')
    return False
betolink commented 7 months ago

The check is intended to verify in-region execution but shouldn't be limited to EC2, I think your change would be a valid PR if it works the same way in EC2!

sabinehelm commented 7 months ago

Thanks @betolink. This is great to hear! I did not use exactly the same code snippet as given in #231. But we used the following to get the current region, which should also work from EC2:

my_session = boto3.session.Session()
my_region = my_session.region_name
JessicaS11 commented 7 months ago

@sabinehelm Thanks for sharing your updated solution. We worked on this a bunch today and decided to try and use botocore directly: botocore.session.get_session().get_config_variable("region")

Can you confirm whether or not this will work for your use case?

jhkennedy commented 7 months ago

I don't think boto3/botocore is going to do what you want -- namely, determine which region you're actually running in.

Boto is going to pull the session information from your AWS config or AWS_* environment variables, so it's more checking what region you're configured to access (what APIs to hit) than what region you're actually running in.

For example, on my laptop:

>>> import botocore.session
>>> botocore.session.get_session().get_config_variable("region")
'us-west-2'

because I have my default region in my ~/.aws/config set up like so:

[default]
region = us-west-2

Likewise, if I instead do:

>>> import os
>>> import botocore.session
>>> os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
>>> botocore.session.get_session().get_config_variable("region")
'us-east-1'

AFAIK, from an EC2 instance, the only way to determine what region the AWS instance is running in is to hit this special IP address:

http://169.254.169.254/latest/meta-data/placement/region

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html#instance-metadata-ex-1

And you'll need to handle IMDSv1 or IMDSv2 metadata rquests -- see this answer on SO: https://stackoverflow.com/a/77902397

This should work on ECS configured for EC2, but I don't know if it works on ECS configured for Fargate (I suspect it will though).

jhkennedy commented 7 months ago

It looks like on Fargate, the AWS_REGION environment variable is set, so the botocore method should work there, but it's not robust in that it's just checking environment variables that are mutable and used to primarily select which APIs to interact with, not to determine what region you're running in.

jhkennedy commented 7 months ago

On ECS, here's a summary of available metadata: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint.html

So, determining this is non-trivial and requires knowing a bit about the service you're running in.

sabinehelm commented 7 months ago

@sabinehelm Thanks for sharing your updated solution. We worked on this a bunch today and decided to try and use botocore directly: botocore.session.get_session().get_config_variable("region")

Can you confirm whether or not this will work for your use case?

@JessicaS11: Thanks for your response. I tested the code snippet. It would work for our use case. But I fear it is already outdated after the last comments.

JessicaS11 commented 7 months ago

Thanks @sabinehelm, and agreed!

@jhkennedy, can you comment on what you'd recommend for #231 and #424? If I understand what you're saying correctly, our current implementation in #424 would not properly keep the user from running a notebook out of region (or tell us if someone really is working out of region), depending on how their config parameters are set.

jhkennedy commented 7 months ago

If I understand what you're saying correctly, our current implementation in https://github.com/nsidc/earthaccess/pull/424 would not properly keep the user from running a notebook out of region (or tell us if someone really is working out of region), depending on how their config parameters are set.

Yes, correct. The current implementation will determine which AWS API you're configured to hit. That could be something the user manually configured or the one set by a service, but since it's all via config files or environment variables, there's no good way of knowing which.

@jhkennedy, can you comment on what you'd recommend for https://github.com/nsidc/earthaccess/issues/231 and https://github.com/nsidc/earthaccess/pull/424? I

@JessicaS11 I'm not sure -- this is hard.

So, I'll start by saying the answer to:

Was the spirit to check specifically for us-west-2, or to enable the user to see what region they are running in?

It should probably be yes to both, or at least confirm what region they are in.

So, overall, I think I'd recommend:

  1. creating an earthaccess.aws module with:
    1. get_ec2_region, which has most of the current implementation inside _running_in_us_west_2 https://github.com/nsidc/earthaccess/blob/main/earthaccess/store.py#L146-L155
    2. get_fargate_region, which would need to do the same, but for fargate as detailed here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-metadata.html
    3. (maybe?) get_ecs_container_region as detailed here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-metadata.html
    4. (maybe) get_config_region which would use boto3/botocore as in #424
    5. And then a get_region (get_running_region?) method that'd go through the above 4 in order and return the first one that didn't raise (or return None depending on implementation). If it hit 4, I'd probably also throw a warning like "Region inferred from AWS config or environment variable and may not represent the region you're running in."

Then you could also have a convenience method like

def earthaccess.aws.ensure_in_region(region: str = 'us-west-2') -> bool:

on top of it and use it in the Earthaccess Store.

Note: Actual method/function names could be improved


All that said, this seems like something that should already exist so it'd be worth spending some time searching GitHub/PyPI... I don't see anything though, oddly.

jhkennedy commented 7 months ago

For ec2, it might be worth just using this package: https://github.com/adamchainz/ec2-metadata

It looks well maintained, is sponsored, and is a "critical" project on PyPI.

I still don't see anything for ECS and Fargate, however.

betolink commented 7 months ago

I like what you're proposing @jhkennedy, and one would think that this should already be on a package that would work in all the execution environments in AWS, EC2, ECS, Lambda, etc

yuvipanda commented 7 months ago

ec2-metadata will not work on any z2jh instances by default, as access to the metadata server is explicitly blocked by default (https://z2jh.jupyter.org/en/stable/administrator/security.html#block-cloud-metadata-api-with-a-privileged-initcontainer-running-iptables).

betolink commented 7 months ago

If I test the current approach to verify in-region execution it works from Openscapes (2i2c) is this the same endpoint? @yuvipanda

http://169.254.169.254/latest/meta-data/placement/region

https://github.com/nsidc/earthaccess/blob/e187ae001162d7b844e621c53abbbf26d87305f3/earthaccess/store.py#L140

yuvipanda commented 7 months ago

@betolink yes, because we've intentionally unblocked that access point in the openscapes hub :) But we coupled it with appropriate IRSA roles so it is secure. At least when I last looked, just unblocking access to the metadata server without setting up IRSA or similar was pretty insecure.

betolink commented 7 months ago

Ah, this reminded me of an issue the VEDA hub reported, when using earthaccess the library didn't detect in-region execution and used the HTTP links, this is probably why this is happening. cc @abarciauskas-bgse

yuvipanda commented 7 months ago

yeah, i'd suggest (similar to @jhkennedy elsewhere) to look at possibly looking for the redirects coming to figure out whatever you need to do internally, as ultimately everything else is going to be only a heuristic. For example we're going to do https://github.com/2i2c-org/infrastructure/issues/3273 soon for the openscapes hub, not sure what effect that will have on ec2-metadata.

abarciauskas-bgse commented 7 months ago

@betolink sorry for the delay here but I verified that an updated earthaccess (v0.8.2) does open via S3 direct access on VEDA, whereas the current version installed on VEDA's Hub (v0.5.2) does not properly register that the instance is in-region. Hopefully this will be resolved once https://github.com/NASA-IMPACT/veda-jh-environments/issues/41 is completed.

itcarroll commented 5 months ago

Overheard from maintainers of oss.smce.nasa.gov:

yes, we do indeed block the instance metadata (for security reasons), so the check that earthaccess is making is not ideal for what they’re trying to do.

meteodave commented 1 week ago

I am using AWS pcluster instance and I need to add earthaccess.__store__.in_region = True to my script to enable the s3 earthdata.download() transfer while already located in the US-West-2 region.