intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

Trouble reading from on-premise s3 storage #85

Closed cwerner closed 3 years ago

cwerner commented 3 years ago

Hi 👋
[and sorry for bothering the list again - please let me know if usage questions should go elsewhere].

I came across another roadblock for my use of intake-xarray and was wondering if this is a known problem or if I do things the wrong way...

I want to read a NetCDF file from our local S3 server (NetApp). I can do it manually like so:

import s3fs
import xarray as xr
fs = s3fs.S3FileSystem(anon=True, default_fill_cache=False, client_kwargs={"endpoint_url": 'https://s3.imk-ifu.kit.edu:8082'})
fobj = fs.open("ldndcdata/GLOBAL_WISESOIL_S1_LR.nc")
ds = xr.open_dataset(fobj, engine='h5netcdf')
print(ds)

However, with my catalog.yml I cannot seem to load this file. This is currently the intake catalog:

plugins:
  source:
    - module: intake_xarray
sources:
  soil:
    name: 'SOIL'
    description: 'Default soil data for ldndctools (site file generation)'
    driver: netcdf
    parameters:
      res:
        default: 'LR'
        allowed: ['LR', 'MR', 'HR']
        description: 'Resolution (LR, MR or HR).'
        type: str
    args:
      urlpath: 'simplecache::s3://ldndcdata/GLOBAL_WISESOIL_S1_{{res}}.nc'
      storage_options:
        anon: true
        default_fill_cache: false
        client_kwargs:
          endpoint_url: 'https://s3.imk-ifu.kit.edu:8082'

      chunks: {}
      xarray_kwargs:
        decode_times: false
        engine: h5netcdf

I try to load the data like this:

import intake
cat = intake.open_catalog('catalog.yml')
ds = cat.soil(res='LR').read()

This results in a huge traceback (NoCredentialsError). I wonder if intake parses any environment variables that interfere or if the catalog is not correct?

martindurant commented 3 years ago

I don't know if this is well documented, but if you have multiple chained protocols in your url, then storage_options should be split for each protocol, so the block would look like

      storage_options:
        s3:
          anon: true
          default_fill_cache: false
          client_kwargs:
            endpoint_url: 'https://s3.imk-ifu.kit.edu:8082'

and this would allow you to specify options for the simplecache part separately, if you want to.

cwerner commented 3 years ago

Oh wow! So easy... and totally not what I'd have expected to be the cause. Thanks a bunch!

martindurant commented 3 years ago

Would be glad to see examples like this added to the docs or examples folder - or even mentioned in the intake or fsspec pages.