Open gjoseph92 opened 2 years ago
I just found this issue after working on a NASA deployed jupyterhub instance that is able to access data on s3 without any additional configuration - I can do xr.open_dataset(<s3_url>, engine="rasterio")
and it works fine. When I use stackstac the default AWS config does not seem to be getting passed through.
As a workaround I can pass the default env into gdal_env
kind of like https://github.com/gjoseph92/stackstac/discussions/154#discussioncomment-3961529.
gdal_env = stackstac.DEFAULT_GDAL_ENV.updated(always=dict(session=rio.session.AWSSession(boto3.Session())))
Does that seems like something that can be upstreamed into stackstac? Happy to open a PR if so.
Update: I just tried to use distributed with this setup and unsurprisingly the session is not picklable.
+1 for inheriting rasterio environment!
This week I came across a weird case where I needed to read data from two S3 sources, each with different access credentials (a company bucket and a NASA bucket). Unfortunately, something about the AWS credentials that I passed to stackstac
via gdal_env
to read the data from NASA seems to persist in the environment and break subsequent attempts to read from my company bucket!
I have my company AWS access credentials stored in environment variables which has never failed me but when I add separate credentials into the mix via gdal_env
, I get unexpected results!
To access the NASA data directly from S3, you can get a set of temporary S3 credentials with your Earthdata login credentials. I figured out that I could pass those credentials to stackstac
with the gdal_env
argument following ideas in threads from #133 and https://github.com/gjoseph92/stackstac/discussions/154. This works great until I need to read data from the other private bucket!
I can't produce a truly reproducible example with the private bucket situation, but here is what I am seeing:
import boto3
import pystac
import rasterio
import requests
import stackstack
items = pystac.ItemCollection(...)
# the items describe image assets in a private bucket that I can access with
# AWS credentials stored in environment variables
stack = stackstac.stack(items=items)
nasa_items = pystac.ItemCollection(...)
# request AWS credentials for direct read access
netrc_creds = {}
with open(os.path.expanduser("~/.netrc")) as f:
for line in f:
key, value = line.strip().split(" ")
netrc_creds[key] = value
url = requests.get(
"https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
allow_redirects=False,
).headers["Location"]
creds = requests.get(
url, auth=(netrc_creds["login"], netrc_creds["password"])
).json()
nasa_stack = stackstac.stack(
items=nasa_items,
gdal_env=stackstac.DEFAULT_GDAL_ENV.updated(
always=dict(
session=rasterio.session.AWSSession(
boto3.Session(
aws_access_key_id=creds["accessKeyId"],
aws_secret_access_key=creds["secretAccessKey"],
aws_session_token=creds["sessionToken"],
region_name="us-west-2",
)
)
)
)
)
items = pystac.ItemCollection(...)
# the items describe image assets in a private bucket that I can access with
# AWS credentials stored in environment variables
stack = stackstac.stack(items=items)
This fails with AWS access denied errors! Maybe I am setting up gdal_env
incorrectly, but I am surprised by the credential problems. I even tried setting gdal_env
in my private bucket read operation pulling credentials from environment variables via os
, but it still didn't work.
A very basic read operation using rasterio.Env
to set the AWS credentials via boto3.Session
works as expected:
hls_tif = "s3://lp-prod-protected/HLSL30.020/HLS.L30.T15UXP.2022284T165821.v2.0/HLS.L30.T15UXP.2022284T165821.v2.0.Fmask.tif"
private_tif = "s3://private-bucket/lol.tif"
# read from NASA
with rasterio.Env(
session=boto3.Session(
aws_access_key_id=creds["accessKeyId"],
aws_secret_access_key=creds["secretAccessKey"],
aws_session_token=creds["sessionToken"],
region_name="us-west-2",
)
):
with rasterio.open(hls_tif) as src:
print(src.profile)
# read from private bucket
with rasterio.Env(
session=boto3.Session(
aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
region_name="us-east-1",
)
):
with rasterio.open(private_tif) as src:
print(src.profile)
# read from NASA again
with rasterio.Env(
session=boto3.Session(
aws_access_key_id=creds["accessKeyId"],
aws_secret_access_key=creds["secretAccessKey"],
aws_session_token=creds["sessionToken"],
region_name="us-west-2",
)
):
with rasterio.open(hls_tif) as src:
print(src.profile)
My workaround for now is to do all of the work in my original private bucket first, then do the work in the NASA bucket afterwards. It works but it is not a satisfying solution!
Does it work if you have different sessions?
In https://github.com/gjoseph92/stackstac/issues/132 I noticed the snippet:
which doesn't currently work the way you'd expect (the environment settings you've just created will be ignored at compute time), but might be a pretty intuitive way to set extra GDAL options without mucking around with
LayeredEnv
s and the defaults.We could even deprecate support for passing in a
LayeredEnv
directly, since it's far more complexity that most users would need, and erring on the side of fewer options is usually better.There's some complexity around the fact that theoretically different types of Readers are supported, though in practice this is not at all true. Nonetheless, it might be worth extending the
Reader
protocol to expose either aDEFAULT_ENV: ClassVar[LayeredEnv]
or aget_default_env() -> LayeredEnv
classmethod.Then ultimately, within
items_to_dask
, we'd pull the default env for the specified reader type, and merge it with any currently-set options (viario.env.getenv()
).