pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

' ' at end of variable name causes to_netcdf() to crash #8029

Closed 28raining closed 1 year ago

28raining commented 1 year ago

What happened?

If variable name ends in a ' ' (space) then to_netcdf crashes.

In my opinion

This is related to https://github.com/pydata/xarray/issues/7943

Our tool converts csv files to XARRAY. So these kind of errors need to self-heal, otherwise we have to feedback to the user and get them to change the csv file. Which would be frustrating for everyone

What did you expect to happen?

No response

Minimal Complete Verifiable Example

with open('endSpace.csv','w') as f:
  f.write('''PASS ,temperature
PASS, 10
FAIL, 20''')

import pandas as pd
df = pd.read_csv('endSpace.csv')
index = ['temperature']
df = df.set_index(index)
xarr = df.to_xarray()
xarr.to_netcdf("new.nc")

MVCE confirmation

Relevant log output

Traceback (most recent call last):
  File "/Users/will/Documents/GitHub/czdashboard/old/coordNameBug.py", line 11, in <module>
    xarr.to_netcdf("new.nc")
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/core/dataset.py", line 1911, in to_netcdf
    return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/backends/api.py", line 1217, in to_netcdf
    dump_to_store(
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/backends/api.py", line 1264, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/backends/common.py", line 271, in store
    self.set_variables(
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/backends/common.py", line 309, in set_variables
    target, source = self.prepare_variable(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 488, in prepare_variable
    nc4_var = self.ds.createVariable(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "src/netCDF4/_netCDF4.pyx", line 2945, in netCDF4._netCDF4.Dataset.createVariable
  File "src/netCDF4/_netCDF4.pyx", line 4164, in netCDF4._netCDF4.Variable.__init__
  File "src/netCDF4/_netCDF4.pyx", line 2014, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Name contains illegal characters

Anything else we need to know?

No response

Environment

commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.11.2.final.0 python-bits : 64 OS : Darwin OS-release : 22.5.0 Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:21:34 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.5.3 numpy : 1.24.2 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.6.3 pip : 23.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2023.3.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : 2023.2.0 xlrd : None xlwt : None zstandard : None tzdata : None
headtr1ck commented 1 year ago

Thanks for the report.

According to the netcdf standard:

"Names that have trailing space characters are also not permitted."

See: https://docs.unidata.ucar.edu/nug/current/netcdf_data_set_components.html

Maybe you could add a quick cleanup script before writing?

28raining commented 1 year ago

I have done that. I have over 100 variables so to figure it was a trailing space was time consuming.

And as in other ticket, how many of these random gotchas are there.

At least the error message needs to be better. Ideally the tool should just handle it; either to_netcdf cleans up, or to_xarray cleans up

headtr1ck commented 1 year ago

This error is raised by netCDF4, xarray doesn't care about the variable names, they don't even need to be strings.

You could open an issue with netCDF4 to improve the error message. I agree that it should at least contain the invalid name and possibly even the violated rule.

TomNicholas commented 1 year ago

Ideally the tool should just handle it; either to_netcdf cleans up, or to_xarray cleans up

to_xarray shouldn't clean it up because spaces (and forward slashes) are (currently) valid characters in xarray variable names.

to_netcdf probably shouldn't automatically clean it up either because then you will silently write a file with different metadata than specified, which would break round-tripping.

What we could do is raise a better error message (as @headtr1ck says), which ideally would be implemented upstream in netCDF4 but could also be implemented as a check within the corresponding xarray backend, similar to https://github.com/pydata/xarray/pull/7953.

@28raining if you want to open a pull request to add an error message like https://github.com/pydata/xarray/pull/7953 does then I think that would be welcome (others please point out if you see an issue with that).

headtr1ck commented 1 year ago

I totally agree.

In general I would advise against implementing the same checks on xarray side, this just leads to divergent behavior in the future. But in this case, I would say that the netcdf rules are put in stone and additionally we can advice users to try a different backend that allows invalid netcdfs.

ZedThree commented 1 year ago

I've opened a PR upstream to at least make the error messages more helpful: https://github.com/Unidata/netcdf4-python/pull/1268.

Suggestions welcome on further improvements!