intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

`intake_xarray` does not lazy read metadata from files #137

Open kadykov opened 11 months ago

kadykov commented 11 months ago

The entries powered by intake_xarray driver does not lazy read metadata from the files.

# %%
import intake
import xarray as xr

ds = xr.Dataset(
    {
        "test_var": [0],
    },
    attrs={"xarray_metadata": "The metadata in the xarray file"},
)
ds.to_netcdf("test_metadata.nc")
ds.to_zarr("test_metadata.zarr", mode="w")
# %%
catalog_content = """sources:
  netcdf:
    driver: netcdf
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.nc'
      metadata:
        catalog_metadata: The metadata in the catalog entry
  zarr_intake_xarray:
    description: zarr archive read by intake_xarray
    driver: zarr
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr'
      metadata:
        catalog_metadata: The metadata in the catalog entry
  zarr_intake:
    description: zarr archive read by intake
    driver: zarr_cat
    args:
      urlpath: '{{ CATALOG_DIR }}/test_metadata.zarr'
      metadata:
        catalog_metadata: The metadata in the catalog entry
"""

with open("catalog.yml", "w") as f:
    f.write(catalog_content)

cat = intake.open_catalog("catalog.yml")
print(f"{cat.netcdf.metadata = }")
print(f"{cat.zarr_intake_xarray.metadata = }")
print(f"{cat.zarr_intake.metadata = }")

As you see from the output, the metadata from the entry powered by intake driver has the field from the zarr file:

cat.netcdf.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake_xarray.metadata = {'catalog_metadata': 'The metadata in the catalog entry'}
cat.zarr_intake.metadata = {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}

However, after reading the files, the metadata is complete:

cat.netcdf.read()
cat.zarr_intake_xarray.read()

print(f"Netcdf metadata after reading: {cat.netcdf.metadata}")
print(f"Zarr metadata after reading: {cat.zarr_intake_xarray.metadata}")

Output:

Netcdf metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}
Zarr metadata after reading: {'catalog_metadata': 'The metadata in the catalog entry', 'dims': {'test_var': 1}, 'data_vars': {}, 'coords': ('test_var',), 'xarray_metadata': 'The metadata in the xarray file'}

OS: Windows 10 python 3.11.5 intake 0.7.0 intake_xarray 0.7.0 xarray 2023.8.0 zarr 2.16.1

martindurant commented 11 months ago

What do you think the right behaviour should be? Catalog entries are special in Intake (<2.0) in that they get their subentries eagerly, so they have access to the file metadata immediately, is this what you are getting at?

kadykov commented 11 months ago

I expected that cat.netcdf.metadata includes also the metadata from the file like this: {'catalog_metadata': 'The metadata in the catalog entry', 'xarray_metadata': 'The metadata in the xarray file'}. But now, the xarray_metadata key appears only after reading the whole file by executing cat.netcdf.read().

I think it would be better to have "lazy" metadata reading from files because there also could be some useful information... What do you think?

martindurant commented 11 months ago

The .discover() method is meant exactly for this purpose, to get information from the file with a minimum of reads. It's usefulness varies by file type.

Actually, xarray is lazy by default, so even if you do a .read(), you do no load all the data into memory, only enough for xarray to be able to understand the file's layout (typically the attributes and coordinate arrays).