Closed rsignell-usgs closed 1 year ago
Do you know which line gave you that error? Was it
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1/",
modifier=planetary_computer.sign_inplace,
)
?
Doh! Can't believe I didn't include that. But yep, you guessed right!
@rsignell-usgs can you try the snippet from https://pystac-client.readthedocs.io/en/stable/usage.html#using-custom-certificates? If you have the path to your pem
file, something like
from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/path/to/certfile"
client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)
should do the trick.
I tried and this works without error:
rom pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/la.signell/PA-RootCA-Cert-2023-Pub.pem"
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io, modifier=planetary_computer.sign_inplace)
but when I try the next step:
catalog.get_collection("cil-gdpcir-cc0")
it is busy for a long time (7.5 minutes!) before failing with:
MaxRetryError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443): Max retries exceeded with url: /api/sas/v1/token/rhgeuwest/cil-gdpcir (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))
Hmm that's unfortunate. That's eventually coming from the call to our SAS API, using the requests session at https://github.com/microsoft/planetary-computer-sdk-for-python/blob/80006b988948d4e1decc0b4f1afd01dd06cb41de/planetary_computer/sas.py#L444-L460.
Unfortunately, that session is created in the function, so you wouldn't have a chance to modify it like your earlier example.
You might be best off emulating what the planetary-computer
package does by removing the modifier=planetary_computer.sign
. Something like
from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
import xarray as xr
stac_api_io = StacApiIO()
stac_api_io.session.verify = ""
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)
catalog = catalog.get_collection("cil-gdpcir-cc0")
# get the SAS token
# use either /token/collection-name
# or /token/account-name/container-name
sas_token = stac_api_io.session.get("https://planetarycomputer.microsoft.com/api/sas/v1/token/cil-gdpcir-cc0").json()["token"]
item = next(catalog.get_all_items())
# Update the items to inject the SAS token
for k, v in item.assets.items():
v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token
asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds
I suspect that next you'll say that the call to Azure Blob Storage will fail, but let's see :)
Okay, I'm getting close, but:
for k, v in item.assets.items():
v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token
failed with:
KeyError: 'xarray:open_kwargs'
so I tried:
# Update the items to inject the SAS token
for k, v in item.assets.items():
v.extra_fields['xarray:open_kwargs']={'storage_options':{'sas_token':sas_token}}
and that worked, but then:
asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds
produced
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[18], line 2
1 asset = item.assets["pr"]
----> 2 ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
3 ds
File ~/miniforge3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py:566, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
554 decoders = _resolve_decoders_kwargs(
555 decode_cf,
556 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
562 decode_coords=decode_coords,
563 )
565 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 566 backend_ds = backend.open_dataset(
567 filename_or_obj,
568 drop_variables=drop_variables,
569 **decoders,
570 **kwargs,
571 )
572 ds = _dataset_from_backend_dataset(
573 backend_ds,
574 filename_or_obj,
(...)
584 **kwargs,
585 )
586 return ds
TypeError: NetCDF4BackendEntrypoint.open_dataset() got an unexpected keyword argument 'storage_options'
Interesting... this is still with an item from cil-gdpcir-cc0
?
Maybe it's best to just use the href
object off the asset and be explicit about everything else. So something like
ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={}, storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})
"accoun_name" here refers to the storage account the data live in. For https://rhgeuwest.blob.core.windows.net
it's rhgeuwest
. You can get this off the collection metadata at msft:storage_account
.
TypeError: ClientSession._request() got an unexpected keyword argument 'account_name'
Sorry to be such a dunce here. Maybe Monday my synapses will be firing.
Can you print out your asset.href
? It might be an https
url.
If so, you could scrap the storage_options={"account_name": "rhgeuwest", "sas_token": sas_token}
, and instead append the sas_token
to the url:
ds = xr.open_dataset(asset.href + f"?{sas_token}, engine="zarr", consolidated=True, chunks={})
That said, I think the cil-gdpcir items all use abfs , since fsspec didn't quite handle query parameters on Zarr files like we needed.
@TomAugspurger , yes, they are abfs:
!
This works, returning data:
from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/rsignell/PA-RootCA-Pub.pem"
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)
sas_token = stac_api_io.session.get("https://planetarycomputer.microsoft.com/api/sas/v1/token/cil-gdpcir-cc0").json()["token"]
catalog = catalog.get_collection("cil-gdpcir-cc0")
item = next(catalog.get_all_items())
asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={},
storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})
Thanks for sticking with me on this!
@TomAugspurger, revisiting this issue as I'm working with the NATO folks again on accessing the Sentinel1 data, for which the item assets do have hrefs that are https:
urls.
So I can open these successfully in xarray by appending the sas_token
as you suggest:
tokenized_href = f"{item.assets['hh'].href}?{sas_token}"
da = xr.open_dataset(tokenized_href, engine='rasterio', chunks={'x':512, 'y':512}).squeeze(drop=True)
but of course I really want to access the data using stackstac
or odc.stac
.
Do you know how I would pass my sas_token
to those tools?
In the planetary-computer
Python package, we walk the asset of each item and, if it looks like a Blob Storage URL, we replace the asset href with the HREF that includes a SAS token. All that logic is at https://github.com/microsoft/planetary-computer-sdk-for-python/blob/main/planetary_computer/sas.py. I think you could do something similar.
At that point the asset HREFs are just regular HTTPs URLs that all of those tools understand.
Doh! Okay, makes sense!
I got odc.stac to work using this approach! https://nbviewer.org/gist/rsignell/c32949c59574beca9262e27dcc1de2eb Thanks for the help @TomAugspurger !
I tried running the CMIP6 ensemble notebook locally (not on the hub) at my current organization and got:
I know I'm supposed to be pointing at a custom .pem file I have in my home dir, but not sure how to specify that in the context of planetary computer notebooks.
In another notebook, I was able to overcome this by adding
verify=false
to the fsspecclient_kwargs
, but I know that (a) that's not a great solution, and (b) not sure how to even do that here.