pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

XArray does not support Latin characters in netCDF file names #9282

Open devos0024 opened 1 month ago

devos0024 commented 1 month ago

What happened?

When you try to open an existing netCDF file named "bépo.nc", for example, you get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'E:\\temp\\bépo.nc'

What did you expect to happen?

Internally, netCDF transforms the file path into an array of bytes. This transformation can be configured by setting the appropriate encoding in netCDF4.Dataset constructor What I expect is to be able to pass this encoding to netCDF when the file is opened with xarray.

Perhaps the solution would be to send the encoding along with the backend_kwargs parameter :

dataset = xr.open_dataset(r"E:\temp\bépo.nc", mode="r", engine="netcdf4", backend_kwargs={'encoding': 'latin-1'})

Transmitting the encoding would also be necessary in the to_netcdf() function.

Minimal Complete Verifiable Example

import os
import tempfile as tmp

import netCDF4 as nc
import xarray as xr

if __name__ == "__main__":
    with tmp.TemporaryDirectory() as temp_dir:
        # Creating a netCDF file
        tmp_folder = os.path.join(temp_dir, "bèpo")
        os.mkdir(tmp_folder)
        file_path = os.path.join(tmp_folder, "bépo.nc")
        print(f"Try to created {file_path}")
        with nc.Dataset(file_path, mode="w", encoding="Latin-1") as ds:
            print(f"{file_path} successfully created")

        # Open with netCDF
        with nc.Dataset(file_path, mode="r", encoding="Latin-1"):
            print(f"{file_path} successfully opened with netCDF")

        # Open with xarray
        with xr.open_dataset(file_path, mode="r", engine="netcdf4") as xr_ds:
            print(f"{file_path} successfully opened")

MVCE confirmation

Relevant log output

E:\temp\bépo.nc successfully created
E:\temp\bépo.nc successfully opened with netCDF
Traceback (most recent call last):
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\file_manager.py", line 211, in _acquire_with_cache_info
    file = self._cache[self._key]
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\lru_cache.py", line 56, in __getitem__
    value = self._cache[key]
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('E:\\temp\\bépo.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '0916cd5f-58f9-49df-8e74-6c9b109a77cf']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\temp\test_open_dataset.py", line 19, in <module>
    with xr.open_dataset(file_path, mode="r", engine="netcdf4") as xr_ds:
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\api.py", line 571, in open_dataset
    backend_ds = backend.open_dataset(
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\netCDF4_.py", line 645, in open_dataset
    store = NetCDF4DataStore.open(
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\netCDF4_.py", line 408, in open
    return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\netCDF4_.py", line 355, in __init__
    self.format = self.ds.data_model
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\netCDF4_.py", line 417, in ds
    return self._acquire()
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\netCDF4_.py", line 411, in _acquire
    with self._manager.acquire_context(needs_lock) as root:
  File "E:\Tools\Anaconda3\envs\myenv\lib\contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\file_manager.py", line 199, in acquire_context
    file, cached = self._acquire_with_cache_info(needs_lock)
  File "E:\Tools\Anaconda3\envs\myenv\lib\site-packages\xarray\backends\file_manager.py", line 217, in _acquire_with_cache_info
    file = self._opener(*self._args, **kwargs)
  File "src\\netCDF4\\_netCDF4.pyx", line 2469, in netCDF4._netCDF4.Dataset.__init__
  File "src\\netCDF4\\_netCDF4.pyx", line 2028, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: 'E:\\temp\\bépo.nc'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:01:37) [MSC v.1935 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('fr_FR', 'cp1252') libhdf5: 1.14.0 libnetcdf: 4.9.1 xarray: 2024.6.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.1 netCDF4: 1.6.3 pydap: None h5netcdf: None h5py: 3.9.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: 1.4.0 dask: 2024.7.1 distributed: 2024.7.1 matplotlib: 3.9.1 cartopy: None seaborn: 0.13.2 numbagg: None fsspec: 2024.6.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 71.0.4 pip: 24.0 conda: 24.5.0 pytest: 8.3.2 mypy: None IPython: None sphinx: 7.4.7
welcome[bot] commented 1 month ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 1 month ago

I get an error in netCDF — any ideas why yours succeeds in nc? Is it Windows vs Mac?

(also note I needed to change the MCVE path, worth updating the example)

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[3], line 11
      7 with tmp.TemporaryDirectory() as temp_dir:
      8
      9     # Creating a netCDF file
     10     file_path = Path(temp_dir) / "bépo.nc"
---> 11     with nc.Dataset(file_path, mode="w", encoding="Latin-1") as ds:
     12         print(f"{file_path} successfully created")
     14     # Open with netCDF

File src/netCDF4/_netCDF4.pyx:2469, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2027, in netCDF4._netCDF4._ensure_nc_success()
INSTALLED VERSIONS ------------------ commit: 42ed6d30e81dce5b9922ac82f76c5b3cd748b19e python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] python-bits: 64 OS: Darwin OS-release: 23.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.3.1.dev31+gb9163a6f pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: 2.17.2 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.4.1 distributed: 2024.4.1 matplotlib: 3.8.4 cartopy: None seaborn: 0.13.2 numbagg: 0.8.1 fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: None flox: None numpy_groupies: 0.10.2 setuptools: 69.2.0 pip: 24.0 conda: None pytest: 8.1.1 mypy: 1.8.0 IPython: 8.24.0 sphinx: None
devos0024 commented 1 month ago

Yes, there is an error in the path. Sorry about that.

What I see is that Windows can't manage without the encoding parameter or the file created on the file system is bépo.nc.

With Linux, you don't need to specify it, the file created is correct. Creating a netCDF file with Latin-1 encoding doesn't cause an error, but it ends up as b?po.nc on the file system

I don't have a Mac, so I can't do the test. But it's possible that encoding on Mac is managed differently from Windows and Linux...

max-sixty commented 1 month ago

Yes, there is an error in the path. Sorry about that.

Is the example you posted correct? If not, could you update it?

devos0024 commented 1 month ago

Example fixed (using temporary folder from context manager)

max-sixty commented 1 month ago

Thanks.

FYI I still get this error — I'm guessing that's down to it being a Mac vs Linux issue...

Try to created /var/folders/wf/s6ycxvvs4ln8qsdbfx40hnc40000gn/T/tmp8rywv9ge/bèpo/bépo.nc
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[1], line 14
     12 file_path = os.path.join(tmp_folder, "bépo.nc")
     13 print(f"Try to created {file_path}")
---> 14 with nc.Dataset(file_path, mode="w", encoding="Latin-1") as ds:
     15     print(f"{file_path} successfully created")
     17 # Open with netCDF

File src/netCDF4/_netCDF4.pyx:2469, in netCDF4._netCDF4.Dataset.__init__()

File src/netCDF4/_netCDF4.pyx:2027, in netCDF4._netCDF4._ensure_nc_success()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 62: invalid continuation byte
devos0024 commented 3 weeks ago

This test demonstrates the issue on a Windows platform configured as CP1252. To make it work, such a Windows is required.

dcherian commented 3 weeks ago

Forwarding encoding to netCDF4 seems like a good idea in general. Though since that clashes with an Xarray kwargs perhaps rename it to filename_encoding at the Xarray level./