pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

mfdataset - ds.encoding["source"] to retrieve filename not valid key #9142

Closed lbesnard closed 5 days ago

lbesnard commented 2 weeks ago

What happened?

Looking at the doc https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html

preprocess (callable(), optional) – If provided, call this function on each dataset prior to concatenation. You can find the file-name from which each dataset was loaded in ds.encoding["source"].

I expected to be able to use ds.encoding["source"] in my preprocess function to retrieve the filename. However I get

What did you expect to happen?

I expected the doc to be correct? unless I missed something trivial.

Minimal Complete Verifiable Example

def preprocess_xarray_no_class(ds):
    filename = ds.encoding["source"]
    ds = ds.assign(
        filename=("time"), [filename])
    )  # add new filename variable with time dimension
    return ds

ds = xr.open_mfdataset(
    fileset,
    preprocess=preprocess_xarray_no_class,
    engine='h5netcdf',
    concat_characters=True,
    mask_and_scale=True,
    decode_cf=True,
    decode_times=True,
    use_cftime=True,
    parallel=True,
    decode_coords=True,
    compat="equals",

)

MVCE confirmation

Relevant log output

...
      1 def preprocess_xarray_no_class(ds):
----> 2     filename = ds.encoding["source"]
      3     ds = ds.assign(
      4         filename=("time",), [filename])
      5     )  # add new filename variable with time dimension

KeyError: 'source'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.0-1023-oem machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: en_IE.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.3-development xarray: 2024.6.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.1 netCDF4: 1.7.1 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: 2.18.2 cftime: 1.6.4 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.6.0 distributed: 2024.6.0 matplotlib: 3.9.0 cartopy: None seaborn: 0.13.2 numbagg: 0.8.1 fsspec: 2024.6.0 cupy: None pint: None sparse: None flox: 0.9.7 numpy_groupies: 0.11.1 setuptools: 70.0.0 pip: 24.0 conda: None pytest: 8.2.2 mypy: 1.10.0 IPython: 7.34.0 sphinx: None
keewis commented 2 weeks ago

this depends on what fileset is, unfortunately. If it contains a list of strings (filepaths, urls, or a glob), then yes, it not working might be a bug. If that's fsspec objects, though, we need #8923.

lbesnard commented 1 week ago

fileset is a list of s3_fs objects:

        s3_fs = s3fs.S3FileSystem(anon=True) 

        remote_files = [f"s3://{bucket_name}/{key}" for key in object_keys]

        fileset = [s3_fs.open(file) for file in remote_files]
keewis commented 1 week ago

then #8923 will fix this!