pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
118 stars 54 forks source link

Fix intake catalog reference.yaml for kerchunked datasets #449

Open rsignell-usgs opened 1 year ago

rsignell-usgs commented 1 year ago

For kerchunked datasets recipes, the currently generated intake catalogs don't work because the OSN endpoint_url is not included. For example, for the NWM-2.1-grid1km-LDAS recipe, we get:

sources:
  data:
    args:
      chunks: {}
      consolidated: false
      storage_options:
        fo: Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1393/NWM-2.1-grid1km-LDAS.zarr/reference.json
        remote_options:
          anon: true
        remote_protocol: s3
        skip_instance_cache: true
        target_options: {}
        target_protocol: s3
      urlpath: reference://
    description: ''
    driver: intake_xarray.xzarr.ZarrSource

but the fo doesn't work as a remote_protocol: s3 for OSN because the endpoint_url is not specified.

Two solutions:

  1. Keep target_protocol: s3, but add specify target_options that include endpoint_url as a client_kwarg.
  2. Switch to target_protocol: https, and specify fo with the https path

These solutions both work:

Solution 1:

sources:    
  data:
    driver: intake_xarray.xzarr.ZarrSource
    description: ''
    args:
      urlpath: "reference://"
      consolidated: false
      storage_options:
        target_options:
          anon: true
          client_kwargs: {'endpoint_url': 'https://ncsa.osn.xsede.org'}
        fo: 's3://Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1393/NWM-2.1-grid1km-LDAS.zarr/reference.json'
        remote_options:
          anon: true
        remote_protocol: "s3"

Solution 2:

sources:
  data:
    args:
      chunks: {}
      consolidated: false
      storage_options:
        fo: 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1393/NWM-2.1-grid1km-LDAS.zarr/reference.json'
        remote_options:
          anon: true
        remote_protocol: s3
        skip_instance_cache: true
        target_options: {}
      urlpath: reference://
    description: ''
    driver: intake_xarray.xzarr.ZarrSource

The relevant code is at https://github.com/pangeo-forge/pangeo-forge-recipes/blob/master/pangeo_forge_recipes/recipes/reference_hdf_zarr.py#L77-L83

@sharkinsspatial is this something you can fix?

cisaacstern commented 1 year ago

Thanks for spotting this issue and documenting it here, @rsignell-usgs!