microsoft / PlanetaryComputerExamples

Examples of using the Planetary Computer
MIT License
360 stars 176 forks source link


Closed rsignell-usgs closed 1 year ago

rsignell-usgs commented 1 year ago

I tried running the CMIP6 ensemble notebook locally (not on the hub) at my current organization and got:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain

I know I'm supposed to be pointing at a custom .pem file I have in my home dir, but not sure how to specify that in the context of planetary computer notebooks.

In another notebook, I was able to overcome this by adding verify=false to the fsspec client_kwargs, but I know that (a) that's not a great solution, and (b) not sure how to even do that here.

TomAugspurger commented 1 year ago

Do you know which line gave you that error? Was it

catalog =


rsignell-usgs commented 1 year ago

Doh! Can't believe I didn't include that. But yep, you guessed right!

TomAugspurger commented 1 year ago

@rsignell-usgs can you try the snippet from If you have the path to your pem file, something like

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/path/to/certfile"
client ="", stac_io=stac_api_io)

should do the trick.

rsignell-usgs commented 1 year ago

I tried and this works without error:

rom pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/la.signell/PA-RootCA-Cert-2023-Pub.pem"
catalog ="", stac_io=stac_api_io, modifier=planetary_computer.sign_inplace)

but when I try the next step:


it is busy for a long time (7.5 minutes!) before failing with:

MaxRetryError: HTTPSConnectionPool(host='', port=443): Max retries exceeded with url: /api/sas/v1/token/rhgeuwest/cil-gdpcir (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))
TomAugspurger commented 1 year ago

Hmm that's unfortunate. That's eventually coming from the call to our SAS API, using the requests session at

Unfortunately, that session is created in the function, so you wouldn't have a chance to modify it like your earlier example.

You might be best off emulating what the planetary-computer package does by removing the modifier=planetary_computer.sign. Something like

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
import xarray as xr

stac_api_io = StacApiIO()
stac_api_io.session.verify = ""
catalog ="", stac_io=stac_api_io)

catalog = catalog.get_collection("cil-gdpcir-cc0")

# get the SAS token
# use either /token/collection-name
# or         /token/account-name/container-name
sas_token = stac_api_io.session.get("").json()["token"]

item = next(catalog.get_all_items())

# Update the items to inject the SAS token
for k, v in item.assets.items():
    v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token

asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])

I suspect that next you'll say that the call to Azure Blob Storage will fail, but let's see :)

rsignell-usgs commented 1 year ago

Okay, I'm getting close, but:

for k, v in item.assets.items():
    v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token

failed with:

KeyError: 'xarray:open_kwargs'

so I tried:

# Update the items to inject the SAS token
for k, v in item.assets.items():

and that worked, but then:

asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])


TypeError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 asset = item.assets["pr"]
----> 2 ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
      3 ds

File ~/miniforge3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    554 decoders = _resolve_decoders_kwargs(
    555     decode_cf,
    556     open_backend_dataset_parameters=backend.open_dataset_parameters,
    562     decode_coords=decode_coords,
    563 )
    565 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 566 backend_ds = backend.open_dataset(
    567     filename_or_obj,
    568     drop_variables=drop_variables,
    569     **decoders,
    570     **kwargs,
    571 )
    572 ds = _dataset_from_backend_dataset(
    573     backend_ds,
    574     filename_or_obj,
    584     **kwargs,
    585 )
    586 return ds

TypeError: NetCDF4BackendEntrypoint.open_dataset() got an unexpected keyword argument 'storage_options'
TomAugspurger commented 1 year ago

Interesting... this is still with an item from cil-gdpcir-cc0?

Maybe it's best to just use the href object off the asset and be explicit about everything else. So something like

ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={}, storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})

"accoun_name" here refers to the storage account the data live in. For it's rhgeuwest. You can get this off the collection metadata at msft:storage_account.

rsignell-usgs commented 1 year ago
TypeError: ClientSession._request() got an unexpected keyword argument 'account_name'
rsignell-usgs commented 1 year ago

Sorry to be such a dunce here. Maybe Monday my synapses will be firing.

TomAugspurger commented 1 year ago

Can you print out your asset.href? It might be an https url.

If so, you could scrap the storage_options={"account_name": "rhgeuwest", "sas_token": sas_token}, and instead append the sas_token to the url:

ds = xr.open_dataset(asset.href + f"?{sas_token}, engine="zarr", consolidated=True, chunks={})

That said, I think the cil-gdpcir items all use abfs , since fsspec didn't quite handle query parameters on Zarr files like we needed.

rsignell-usgs commented 1 year ago

@TomAugspurger , yes, they are abfs:!

This works, returning data:

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/rsignell/PA-RootCA-Pub.pem"
catalog ="", stac_io=stac_api_io)

sas_token = stac_api_io.session.get("").json()["token"]

catalog = catalog.get_collection("cil-gdpcir-cc0")

item = next(catalog.get_all_items())

asset = item.assets["pr"]

ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={}, 
                     storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})

Thanks for sticking with me on this!

rsignell commented 3 months ago

@TomAugspurger, revisiting this issue as I'm working with the NATO folks again on accessing the Sentinel1 data, for which the item assets do have hrefs that are https: urls.

So I can open these successfully in xarray by appending the sas_token as you suggest:

tokenized_href = f"{item.assets['hh'].href}?{sas_token}"
da = xr.open_dataset(tokenized_href, engine='rasterio', chunks={'x':512, 'y':512}).squeeze(drop=True)

but of course I really want to access the data using stackstac or odc.stac.

Do you know how I would pass my sas_token to those tools?

TomAugspurger commented 3 months ago

In the planetary-computer Python package, we walk the asset of each item and, if it looks like a Blob Storage URL, we replace the asset href with the HREF that includes a SAS token. All that logic is at I think you could do something similar.

At that point the asset HREFs are just regular HTTPs URLs that all of those tools understand.

rsignell commented 3 months ago

Doh! Okay, makes sense!

rsignell commented 3 months ago

I got odc.stac to work using this approach! Thanks for the help @TomAugspurger !