Accessing CMIP6 Zarr Files with R

laurenkwick commented 4 months ago

Hi!

I want to start off by saying how amazing it is to work with the datasets available through Microsoft PC.

I'm currently having some trouble figuring out how to access the CMIP6 Public Domain Collection using R. In a tutorial that I am referencing from the r-spatial blog, it is possible to read Zarr files using the R package stars because GDAL has a Zarr driver that can read these data through its multidimensional array API. However, I don't know if the GDAL driver is able to implicitly handle the connection to Azure Blob Storage in the same way that xarray does in Python.

For example, in the example notebook for CMIP6 CC0-1.0, the STAC item asset properties xarray:open_kwargs inform how the dataset should be open. The xarray:open_kwargs dictionary defines the # chunks, engine to use, consolidation method, and storage options account name.

import xarray as xr

asset = item.assets["tasmax"]
ds = xr.combine_by_coords(
    [
        xr.open_dataset(asset.href, **asset.extra_fields["xarray:open_kwargs"])
        for asset in item.assets.values()
    ],
    combine_attrs="drop_conflicts",
)
ds

When looking at the GDAL Zarr driver open options, I don't see a way to include all of these components (specifically the storage options account name), so I end up a receiving a "File Not Found" error. I'm not quite sure how xarray is able to open the Zarr dataset located in Azure Blob Storage, and if it is possible to replicate this in R through the GDAL Zarr driver. Below is the code I was using to test out the CMIP6 access with rstac and stars library.

library(rstac)
library(stars)

# Just grab one item
cmi_item <- pc |>
  rstac::stac_search(collections = "cil-gdpcir-cc0", ids = "cil-gdpcir-INM-INM-CM5-0-historical-r1i1p1f1-day") |>
  rstac::items_sign_planetary_computer() |>
  post_request()

# Grab only the tasmax asset
item_tasmax <- cmi_item |>
  assets_select(asset_names = "tasmax")

# Get the asset url
tasmax_url <- assets_url(item_tasmax)

# Set file name
dsn <- paste0("/vsicurl/", tasmax_url)

# Read data using GDAL's multidimensional array API
d <- read_mdim(dsn)

I understand that Zarr was developed in the the Python numpy/xarray communities, however I am hoping to work with this data in R because the rest of my workflow is developed using it. Is it possible to do this?

Thank you!

TomAugspurger commented 4 months ago

I think that with the right URL GDAL will be able to read it. There is an msft:https-url for each asset, like https://rhgeuwest.blob.core.windows.net/cil-gdpcir/ScenarioMIP/INM/INM-CM5-0/ssp585/r1i1p1f1/day/pr/v1.1.zarr that could be used with /vsicurl?

laurenkwick commented 4 months ago

Hey, thanks for the quick response.

I tried using the msft:https-url and I'm still running into some issues:

> dsn <- 'ZARR:"/vsicurl/https://rhgeuwest.blob.core.windows.net/cil-gdpcir/CMIP/INM/INM-CM5-0/historical/r1i1p1f1/day/tasmax/v1.1.zarr"'
> d <- read_mdim(dsn)
Error: file not found
In addition: Warning message:
In CPL_read_mdim(file, array_name, options, offset, count, step,  :
GDAL Error 4: `ZARR:"/vsicurl/https://rhgeuwest.blob.core.windows.net/cil-gdpcir/CMIP/INM/INM-CM5-0/historical/r1i1p1f1/day/tasmax/v1.1.zarr"' does not exist in the file system, and is not recognized as a supported dataset name.

I also testing without the 'ZARR:' prefix and got the same error.

I'm wondering if setting credentials through environmental variables is required as documented here.

TomAugspurger commented 4 months ago

Thanks for the tip about the ZARR prefix. I think it might be a conflict with how we do auth (as a query parameter) and how GDAL Zarr discovers the files under that prefix.

(gdal) taugspurger@DESKTOP-D37TN6N:~$ export TOKEN=$(curl --silent "https://planetarycomputer.microsoft.com/api/sas/v1/token/cil-gdpcir-cc0" | jq -r .token)
(gdal) taugspurger@DESKTOP-D37TN6N:~$ gdalmdiminfo 'ZARR:"/vsicurl/https://rhgeuwest.blob.core.windows.net/cil-gdpcir/ScenarioMIP/BCC/BCC-CSM2-MR/ssp126/r1i1p1f1/day/pr/v1.1.zarr?"'$TOKEN'""'
Warning 1: HTTP response code on https://rhgeuwest.blob.core.windows.net/cil-gdpcir/ScenarioMIP/BCC/BCC-CSM2-MR/ssp126/r1i1p1f1/day/pr/v1.1.zarr?st=2024-06-30T18%3A51%3A38Z&se=2024-07-01T19%3A36%3A38Z&sp=r...&sig=...%3D/.zarray: 403

Notice that it's looking for the .zarray file at https://.../v1.1.zarr?$TOKEN/.zarray instead of https://.../v1.1.zarr/.zarray?$TOKEN. I'm not sure if there's another way to tell GDAL what query parameters to use.

laurenkwick commented 4 months ago

Thanks! That definitely explains my trouble with accessing the dataset. And also is helping me dive deeper into this rabbit hole...

I was digging around to see if there is a way to tell GDAL which query parameter to use and just wanted to document this finding here. I am still trying to see if I can get it to work but I saw that "options can be passed in the filename with the following syntax: /vsicurl?[option_i=val_i&]*url=http://... where each option name and value (including the value of "url") is URL-encoded. " (from /vsicurl/ (http/https/ftp files: random access)

These supported options stand out to me, but I'm not quite there with getting them to work:

pc_url_signing=yes/no: whether to use the URL signing mechanism of Microsoft Planetary Computer (https://planetarycomputer.microsoft.com/docs/concepts/sas/). (GDAL >= 3.5.2). Note that starting with GDAL 3.9, this may also be set with the path-specific option ( cf VSISetPathSpecificOption()) VSICURL_PC_URL_SIGNING set to YES.
pc_collection=name: name of the collection of the dataset for Planetary Computer URL signing. Only used when pc_url_signing=yes. (GDAL >= 3.5.2)

In this stack exchange post, it looks like they ended up using the href url, instead of the msft:https-url. It's not quite the same workflow as what I have here, but seems promising?

I'll post an update if I get somewhere with this! Just wanted to update where I'm at now.

777arc commented 1 month ago

Closing, feel free to re-open if there are any updates!

microsoft / PlanetaryComputer

Accessing CMIP6 Zarr Files with R #366