intake / intake

Intake is a lightweight package for finding, investigating, loading and disseminating data.
https://intake.readthedocs.io/
BSD 2-Clause "Simplified" License
1.01k stars 141 forks source link

Error reading an `xarray.Dataset` #850

Open NathanCummings opened 2 weeks ago

NathanCummings commented 2 weeks ago

I'm trying to define a catalog with an Xarray reader for my Zarr files using intake v2. Looking at the available readers, I think the following should work, but I am getting the exception below.

import intake
import xarray as xr

reader = intake.reader_from_call(
    "xr.open_dataset('https://s3.echo.stfc.ac.uk/mast/level1/shots/30420.zarr/amc', engine='zarr')"
)

This is a public bucket, and the data are licensed under CC-BY-SA, so this url is fine to use for testing.

Traceback (most recent call last):
  File "/Users/nathan/fair-mast/testing/test.py", line 5, in <module>
    reader = intake.reader_from_call(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nathan/fair-mast/testing/.venv/lib/python3.12/site-packages/intake/readers/readers.py", line 1726, in reader_from_call
    datacls = datacls(**data_kw)
              ^^^^^^^^^^^^^^^^^^
TypeError: Service.__init__() got an unexpected keyword argument 'storage_options'

I was working through the debugger, following readers.reader_from_call() and into datatypes.recommend(), but I couldn't follow well enough to be sure where things are going wrong.

martindurant commented 2 weeks ago

Intake doesn't seem to be clever enough to guess that the URL 'https://s3.echo.stfc.ac.uk/mast/level1/shots/30420.zarr/amc' is zarr, even though the xarray and engine context make this clear. Currently, that's the pattern we follow: guess the filetype, and then see if the function called is one of the readers that can act on that type; and this only works for file types.

In this case, the function to call is clear, and we know which readers can produce that kind of data

[_ for _ in intake.readers.utils.subclasses(intake.BaseReader) if "xarray:Dataset" == _.output_instance]

or use that exact function

[_ for _ in intake.readers.utils.subclasses(intake.BaseReader) if "xarray:open_dataset" == _.func or "xarray:open_dataset" in _.other_funcs]

so it really should be possible to guess this case too.

Of course, you can still construct the reader explicitly:

intake.readers.readers.XArrayDatasetReader(intake.datatypes.Zarr("https://s3.echo.stfc.ac.uk/mast/level1/shots/30420.zarr/amc"), engine="zarr")

Note to self: this should still not be an exception, though; either the recommender should only test for file-like types, or it should not pass storage_options when it's not appropriate.

NathanCummings commented 2 weeks ago

Cool, thank you.

Using:

intake.readers.readers.XArrayDatasetReader(intake.datatypes.Zarr("https://s3.echo.stfc.ac.uk/mast/level1/shots/30420.zarr/amc"), engine="zarr")

worked.

As an extra tip, it took me a beat to realise that I needed to add chunks="auto" to make xarray use Dask arrays for the variables, so:

reader = intake.readers.readers.XArrayDatasetReader(
    intake.datatypes.Zarr(
        "https://s3.echo.stfc.ac.uk/mast/level1/shots/30420.zarr/amc"
    ),
    engine="zarr",
    chunks="auto", # need this so xarray will load the variables as dask arrays
)

does what I want.

martindurant commented 2 weeks ago

I surprise that "auto" is not the default, maybe. Intake is, of course, mostly passing through arguments to the actual library doing the reading.