MADIS netCDF to Pandas Dataframe: ValueError: iterator is too large

akrherz commented 8 years ago

Getting started with xarray 0.7 (python2.7 RHEL7.2 64bit) and I am having trouble converting a MADIS produced netCDF file to a pandas Dataframe. Here's the traceback

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    df = nc.to_dataframe()
  File "/usr/lib/python2.7/site-packages/xarray/core/dataset.py", line 1867, in to_dataframe
    return self._to_dataframe(self.dims)
  File "/usr/lib/python2.7/site-packages/xarray/core/dataset.py", line 1856, in _to_dataframe
    for k in columns]
  File "/usr/lib/python2.7/site-packages/xarray/core/variable.py", line 715, in expand_dims
    expanded_data = ops.broadcast_to(self.data, tmp_shape)
  File "/usr/lib/python2.7/site-packages/xarray/core/ops.py", line 65, in f
    return getattr(eager_module, name)(data, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/numpy/lib/stride_tricks.py", line 115, in broadcast_to
    return _broadcast_to(array, shape, subok=subok, readonly=True)
  File "/usr/lib64/python2.7/site-packages/numpy/lib/stride_tricks.py", line 70, in _broadcast_to
    op_flags=[op_flag], itershape=shape, order='C').itviews[0]
ValueError: iterator is too large

here's the code

import xarray as xr
nc = xr.open_dataset('20160430_1600.nc', decode_cf=False)
df = nc.to_dataframe()

Here's the netcdf file

Thanks!

shoyer commented 8 years ago

This is (arguably) a NumPy bug -- the problem is that the to_dataframe() call is trying to create an array with 8e30 elements!

ipdb> shape
(72, 55, 60, 10, 4, 512, 51, 51, 12, 80, 3, 8, 6, 11, 5000, 25, 24, 6277, 24)
ipdb> np.prod(shape)
-8804073483760828416
ipdb> np.prod(np.asarray(shape, dtype=float))
8.6981676921852312e+30

The problem is that these MADIS netCDFs have loads of dimensions, corresponding to strings (and other stuff, if I recall correctly):

<xarray.Dataset>
Dimensions:                (ICcheckNameLen: 72, ICcheckNum: 55, QCcheckNameLen: 60, QCcheckNum: 10, maxHomeWFOlen: 4, maxLDADmessageLen: 512, maxLDADtestLen: 51, maxNameLength: 51, maxProviderIdLen: 12, maxRemark: 80, maxSkyCover: 3, maxSkyLen: 8, maxStaIdLen: 6, maxStaTypeLen: 11, maxStaticIds: 5000, maxWeatherLen: 25, nInventoryBins: 24, recNum: 6277, totalIdLen: 24)

xarray here tries to multiple a MultiIndex for the DataFrame out of the outer product of all these dimensions. It would be nice to have a better fix here, but it's not immediately obvious to me what that would be.

mogismog commented 8 years ago

@shoyer: You're right in that MADIS netCDF files are (imo) poorly formatted. There is also the issue of xarray.decode_cf() not concatenating the string arrays even after fixing the _FillValue, missing_value conflict (hence requiring passing decode_cf=False when opening up the MADIS netCDF file). After looking at the decode_cf code, though, I don't think this is a bug per se (some quick debugging revealed that it doesn't seem like any variable in this netCDF file gets past this check), though if you feel this may in fact be a bug, I can look a bit more into it.

Unfortunately, this does mean I have to do a lot of "manual cleaning" of the netCDF file before exporting as a DataFrame, but it is easy to write a set of functions to accomplish this for you. That said, I can't c/p the exact code (for work-related reasons). I'm not sure how helpful this is, but when working with MADIS netCDF data, I more or less do the following as a workaround:

Open up the MADIS netCDF file, fix the _FillValue and missing_value conflict in the variables.
Drop the variables I don't want (and there is a lot of filler in MADIS netCDF files).
Concatenate the string arrays (e.g. stationId, dataProvider).
Turn into a pandas DataFrame.

Though reading over it, that is kind of a draw the owl-esque response, though. :/

akrherz commented 8 years ago

I thought of something, is the issue here with the unlimited record dimension?

netcdf \20160430_1600 {
dimensions:
       ....
    recNum = UNLIMITED ; // (2845 currently)

mogismog commented 8 years ago

Redeeming myself (only a little bit) from my previous message here:

@akrherz Was messing around with this a bit, this seems to work ok. This gets rid of unnecessary dimensions, concatenates string arrays, and turns it into a pandas DataFrame:

[In [1]: import xarray as xr

In [2]: ds = xr.open_dataset('20160430_1600.nc', decode_cf=True, mask_and_scale=False, decode_times=False) # xarray has issue decoding the times, so you'll have to do this in pandas.

In [3]: vars_to_drop = [k for k in ds.variables.iterkeys() if ('recNum' not in ds[k].dims)]

In [4]: ds = ds.drop(vars_to_drop)

In [5]: df = ds.to_dataframe()

In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6277 entries, 0 to 6276
Data columns (total 93 columns):
invTime                  6277 non-null int32
prevRecord               6277 non-null int32
isOverflow               6277 non-null int32
secondsStage1_2          6277 non-null int32
secondsStage3            6277 non-null int32
providerId               6277 non-null object
stationId                6277 non-null object
handbook5Id              6277 non-null object](url)
~snip~

A bit hacky, but it works.

akrherz commented 8 years ago

@mogismog Awesome, thanks so much for the workaround :)

guytcc commented 6 years ago

Something maybe of interest.

I recently converted some tools we have to do the above from Python 2 to 3. When the files were read in the byte chars were not converted to strings. I couldn't actually get this to work on the xarray side and had to loop through the DataFrame columns with apply(.decode("utf-8")) to decode them properly. I'm assuming this might be in the NetCDF4 library, but not 100% sure.

stale[bot] commented 4 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

akrherz commented 4 years ago

Just to denote that the issue still happens today with numpy=1.18.1, xarray=0.15.0, pandas=1.0.1

>>> df = nc.to_dataframe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4465, in to_dataframe
    return self._to_dataframe(self.dims)
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4451, in _to_dataframe
    data = [
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4452, in <listcomp>
    self._variables[k].set_dims(ordered_dims).values.reshape(-1)
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/variable.py", line 1345, in set_dims
    expanded_data = duck_array_ops.broadcast_to(self.data, tmp_shape)
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/duck_array_ops.py", line 47, in f
    return wrapped(*args, **kwargs)
  File "<__array_function__ internals>", line 5, in broadcast_to
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 182, in broadcast_to
    return _broadcast_to(array, shape, subok=subok, readonly=True)
  File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 125, in _broadcast_to
    it = np.nditer(
ValueError: iterator is too large

pydata / xarray

MADIS netCDF to Pandas Dataframe: ValueError: iterator is too large #838