leap-stc / climsim_feedstock

Apache License 2.0
0 stars 0 forks source link

What is up with the climsim files (netcdf3 or not)?. #2

Open jbusecke opened 3 months ago

jbusecke commented 3 months ago

So I stumbled upon Charles' code here and wanted to formalize this a bit more.

I would love to get rid of the copy_to_local but there seems to be currently no way!

So i took a look at some of the cached files:

import fsspec
import xarray as xr

path = 'gs://leap-scratch/jbusecke/climsim_feedstock/cache/4cb51c5b7e05c6f2661474eca4281969-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-02_e3sm-mmf.mli.0001-02-01-00000.nc' # this is going to get wiped but any of https://huggingface.co/datasets/LEAP/ClimSim_low-res/tree/main/train should work to reproduce

with fsspec.open(path, mode='rb') as f:
    ds = xr.open_dataset(f)
ds

gives

--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[23], line 2 1 with fsspec.open(path, mode='rb') as f: ----> 2 ds = xr.open_dataset(f) 3 ds File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/api.py:571](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/api.py#line=570), in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs) 559 decoders = _resolve_decoders_kwargs( 560 decode_cf, 561 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 567 decode_coords=decode_coords, 568 ) 570 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 571 backend_ds = backend.open_dataset( 572 filename_or_obj, 573 drop_variables=drop_variables, 574 **decoders, 575 **kwargs, 576 ) 577 ds = _dataset_from_backend_dataset( 578 backend_ds, 579 filename_or_obj, (...) 589 **kwargs, 590 ) 591 return ds File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py:326](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py#line=325), in ScipyBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, mode, format, group, mmap, lock) 308 def open_dataset( # type: ignore[override] # allow LSP violation, not supporting **kwargs 309 self, 310 filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore, (...) 323 lock=None, 324 ) -> Dataset: 325 filename_or_obj = _normalize_path(filename_or_obj) --> 326 store = ScipyDataStore( 327 filename_or_obj, mode=mode, format=format, group=group, mmap=mmap, lock=lock 328 ) 330 store_entrypoint = StoreBackendEntrypoint() 331 with close_on_error(store): File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py:178](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py#line=177), in ScipyDataStore.__init__(self, filename_or_obj, mode, format, group, mmap, lock) 170 manager = CachingFileManager( 171 _open_scipy_netcdf, 172 filename_or_obj, (...) 175 kwargs=dict(mmap=mmap, version=version), 176 ) 177 else: --> 178 scipy_dataset = _open_scipy_netcdf( 179 filename_or_obj, mode=mode, mmap=mmap, version=version 180 ) 181 manager = DummyFileManager(scipy_dataset) 183 self._manager = manager File [/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py:126](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/scipy_.py#line=125), in _open_scipy_netcdf(filename, mode, mmap, version) 123 filename = io.BytesIO(filename) 125 try: --> 126 return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version) 127 except TypeError as e: # netcdf3 message is obscure in this case 128 errmsg = e.args[0] File [/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py:279](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py#line=278), in netcdf_file.__init__(self, filename, mode, mmap, version, maskandscale) 276 self._attributes = {} 278 if mode in 'ra': --> 279 self._read() File [/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py:610](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py#line=609), in netcdf_file._read(self) 608 # Read file headers and set data. 609 self._read_numrecs() --> 610 self._read_dim_array() 611 self._read_gatt_array() 612 self._read_var_array() File [/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py:625](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py#line=624), in netcdf_file._read_dim_array(self) 623 for dim in range(count): 624 name = self._unpack_string().decode('latin1') --> 625 length = self._unpack_int() or None # None for record dimension 626 self.dimensions[name] = length 627 self._dims.append(name) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py:786](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/scipy/io/_netcdf.py#line=785), in netcdf_file._unpack_int(self) 785 def _unpack_int(self): --> 786 return int(frombuffer(self.fp.read(4), '>i')[0]) IndexError: index 0 is out of bounds for axis 0 with size 0

I can reproduce this locally by using engine='scipy', but if engine='netcdf4' (the default) it just works. This is super weird.

Is there something broken with fsspec? Or with the files themselves?

SammyAgrawal commented 2 months ago

Interestingly, I seem to have different engine-error combinations. However, I am trying to load directly from the url without any Pangeo Forge wrappers (direct from hub notebook)

with fsspec.open(url, mode='rb').open() as file: # url is 'https://huggingface.co/.../ClimSim_low-res...E3SM-MMF.mli.0002-01-01-00000.nc' xr.open_dataset(file, chunks={}, use_cftime=True) or specifying engine='scipy' yields the same

IndexError: index 0 is out of bounds for axis 0 with size 0

'engine= h5netcdf' gives me

ValueError: b'CDF\x05\x00\x00\x00\x00' is not the signature of a valid netCDF4 file

engine="netcdf4" yields

ValueError: can only read bytes or file-like objects with engine='scipy' or 'h5netcdf'

SammyAgrawal commented 2 months ago

I played around with huggingface datasets library but was bottlenecked by the fact that it can't parse netcdf files (.nc is not a supported file extension)

So I threw up my hands, used requests to make a get requests and manually read the bytes.

resp = requests.get(url) # status code 200
resp.content # b'CDF\x05\x00\x00\x00\x00 ... ' 
xr.open_dataset(resp.content) # IndexError: index 0 is out of bounds for axis 0 with size 0
xr.open_dataset(io.BytesIO(resp.content)) # also gives IndexError (io.BytesIO(response.content) by itself throws no error)

with open("file.nc", 'wb') as f:
    f.write(resp.content)
xr.open_dataset("file.nc") # works??

So, I guess this echoes copy_to_local being necessary for some strange unnecessary reason. But to me, this indicates that it is not a ffspec error but either a error with the files or an error with how xarray processes raw string binary data versus file pointers?

*using requests was not necessary. ffspec.open().read() gives me the same byte string that I can save and use

jbusecke commented 1 month ago

I just confirmed that the newly uploaded 'expanded' data (https://huggingface.co/datasets/LEAP/ClimSim_low-res-expanded/tree/main/train/0001-02) (thanks to @zyhu-hu !) does not suffer from these issues. I will try to target these for a virtual zarr dataset generation https://github.com/leap-stc/climsim_feedstock/issues/9, but they might also serve as a good comparison case here.