microsoft / PlanetaryComputerExamples

Examples of using the Planetary Computer
MIT License
360 stars 176 forks source link

SSL: CERTIFICATE_VERIFY_FAILED error #270

Closed rsignell-usgs closed 1 year ago

rsignell-usgs commented 1 year ago

I tried running the CMIP6 ensemble notebook locally (not on the hub) at my current organization and got:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain

I know I'm supposed to be pointing at a custom .pem file I have in my home dir, but not sure how to specify that in the context of planetary computer notebooks.

In another notebook, I was able to overcome this by adding verify=false to the fsspec client_kwargs, but I know that (a) that's not a great solution, and (b) not sure how to even do that here.

TomAugspurger commented 1 year ago

Do you know which line gave you that error? Was it

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)

?

rsignell-usgs commented 1 year ago

Doh! Can't believe I didn't include that. But yep, you guessed right!

TomAugspurger commented 1 year ago

@rsignell-usgs can you try the snippet from https://pystac-client.readthedocs.io/en/stable/usage.html#using-custom-certificates? If you have the path to your pem file, something like

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/path/to/certfile"
client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)

should do the trick.

rsignell-usgs commented 1 year ago

I tried and this works without error:

rom pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/la.signell/PA-RootCA-Cert-2023-Pub.pem"
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io, modifier=planetary_computer.sign_inplace)

but when I try the next step:

catalog.get_collection("cil-gdpcir-cc0")

it is busy for a long time (7.5 minutes!) before failing with:

MaxRetryError: HTTPSConnectionPool(host='planetarycomputer.microsoft.com', port=443): Max retries exceeded with url: /api/sas/v1/token/rhgeuwest/cil-gdpcir (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))
TomAugspurger commented 1 year ago

Hmm that's unfortunate. That's eventually coming from the call to our SAS API, using the requests session at https://github.com/microsoft/planetary-computer-sdk-for-python/blob/80006b988948d4e1decc0b4f1afd01dd06cb41de/planetary_computer/sas.py#L444-L460.

Unfortunately, that session is created in the function, so you wouldn't have a chance to modify it like your earlier example.

You might be best off emulating what the planetary-computer package does by removing the modifier=planetary_computer.sign. Something like

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
import xarray as xr

stac_api_io = StacApiIO()
stac_api_io.session.verify = ""
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)

catalog = catalog.get_collection("cil-gdpcir-cc0")

# get the SAS token
# use either /token/collection-name
# or         /token/account-name/container-name
sas_token = stac_api_io.session.get("https://planetarycomputer.microsoft.com/api/sas/v1/token/cil-gdpcir-cc0").json()["token"]

item = next(catalog.get_all_items())

# Update the items to inject the SAS token
for k, v in item.assets.items():
    v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token

asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds

I suspect that next you'll say that the call to Azure Blob Storage will fail, but let's see :)

rsignell-usgs commented 1 year ago

Okay, I'm getting close, but:

for k, v in item.assets.items():
    v.extra_fields["xarray:open_kwargs"]["storage_options"]["sas_token"] = sas_token

failed with:

KeyError: 'xarray:open_kwargs'

so I tried:

# Update the items to inject the SAS token
for k, v in item.assets.items():
    v.extra_fields['xarray:open_kwargs']={'storage_options':{'sas_token':sas_token}}

and that worked, but then:

asset = item.assets["pr"]
ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
ds

produced

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 asset = item.assets["pr"]
----> 2 ds = xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
      3 ds

File ~/miniforge3/envs/pangeo/lib/python3.10/site-packages/xarray/backends/api.py:566, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    554 decoders = _resolve_decoders_kwargs(
    555     decode_cf,
    556     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)
    562     decode_coords=decode_coords,
    563 )
    565 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 566 backend_ds = backend.open_dataset(
    567     filename_or_obj,
    568     drop_variables=drop_variables,
    569     **decoders,
    570     **kwargs,
    571 )
    572 ds = _dataset_from_backend_dataset(
    573     backend_ds,
    574     filename_or_obj,
   (...)
    584     **kwargs,
    585 )
    586 return ds

TypeError: NetCDF4BackendEntrypoint.open_dataset() got an unexpected keyword argument 'storage_options'
TomAugspurger commented 1 year ago

Interesting... this is still with an item from cil-gdpcir-cc0?

Maybe it's best to just use the href object off the asset and be explicit about everything else. So something like


ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={}, storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})

"accoun_name" here refers to the storage account the data live in. For https://rhgeuwest.blob.core.windows.net it's rhgeuwest. You can get this off the collection metadata at msft:storage_account.

rsignell-usgs commented 1 year ago
TypeError: ClientSession._request() got an unexpected keyword argument 'account_name'
rsignell-usgs commented 1 year ago

Sorry to be such a dunce here. Maybe Monday my synapses will be firing.

TomAugspurger commented 1 year ago

Can you print out your asset.href? It might be an https url.

If so, you could scrap the storage_options={"account_name": "rhgeuwest", "sas_token": sas_token}, and instead append the sas_token to the url:

ds = xr.open_dataset(asset.href + f"?{sas_token}, engine="zarr", consolidated=True, chunks={})

That said, I think the cil-gdpcir items all use abfs , since fsspec didn't quite handle query parameters on Zarr files like we needed.

rsignell-usgs commented 1 year ago

@TomAugspurger , yes, they are abfs:!

This works, returning data:

from pystac_client.stac_api_io import StacApiIO
from pystac_client.client import Client
stac_api_io = StacApiIO()
stac_api_io.session.verify = "/home/rsignell/PA-RootCA-Pub.pem"
catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1", stac_io=stac_api_io)

sas_token = stac_api_io.session.get("https://planetarycomputer.microsoft.com/api/sas/v1/token/cil-gdpcir-cc0").json()["token"]

catalog = catalog.get_collection("cil-gdpcir-cc0")

item = next(catalog.get_all_items())

asset = item.assets["pr"]

ds = xr.open_dataset(asset.href, engine="zarr", consolidated=True, chunks={}, 
                     storage_options={"account_name": "rhgeuwest", "sas_token": sas_token})

Thanks for sticking with me on this!

rsignell commented 3 months ago

@TomAugspurger, revisiting this issue as I'm working with the NATO folks again on accessing the Sentinel1 data, for which the item assets do have hrefs that are https: urls.

So I can open these successfully in xarray by appending the sas_token as you suggest:

tokenized_href = f"{item.assets['hh'].href}?{sas_token}"
da = xr.open_dataset(tokenized_href, engine='rasterio', chunks={'x':512, 'y':512}).squeeze(drop=True)

but of course I really want to access the data using stackstac or odc.stac.

Do you know how I would pass my sas_token to those tools?

TomAugspurger commented 3 months ago

In the planetary-computer Python package, we walk the asset of each item and, if it looks like a Blob Storage URL, we replace the asset href with the HREF that includes a SAS token. All that logic is at https://github.com/microsoft/planetary-computer-sdk-for-python/blob/main/planetary_computer/sas.py. I think you could do something similar.

At that point the asset HREFs are just regular HTTPs URLs that all of those tools understand.

rsignell commented 3 months ago

Doh! Okay, makes sense!

rsignell commented 3 months ago

I got odc.stac to work using this approach! https://nbviewer.org/gist/rsignell/c32949c59574beca9262e27dcc1de2eb Thanks for the help @TomAugspurger !