Open akrherz opened 8 years ago
This is (arguably) a NumPy bug -- the problem is that the to_dataframe()
call is trying to create an array with 8e30 elements!
ipdb> shape
(72, 55, 60, 10, 4, 512, 51, 51, 12, 80, 3, 8, 6, 11, 5000, 25, 24, 6277, 24)
ipdb> np.prod(shape)
-8804073483760828416
ipdb> np.prod(np.asarray(shape, dtype=float))
8.6981676921852312e+30
The problem is that these MADIS netCDFs have loads of dimensions, corresponding to strings (and other stuff, if I recall correctly):
<xarray.Dataset>
Dimensions: (ICcheckNameLen: 72, ICcheckNum: 55, QCcheckNameLen: 60, QCcheckNum: 10, maxHomeWFOlen: 4, maxLDADmessageLen: 512, maxLDADtestLen: 51, maxNameLength: 51, maxProviderIdLen: 12, maxRemark: 80, maxSkyCover: 3, maxSkyLen: 8, maxStaIdLen: 6, maxStaTypeLen: 11, maxStaticIds: 5000, maxWeatherLen: 25, nInventoryBins: 24, recNum: 6277, totalIdLen: 24)
xarray here tries to multiple a MultiIndex for the DataFrame out of the outer product of all these dimensions. It would be nice to have a better fix here, but it's not immediately obvious to me what that would be.
@shoyer: You're right in that MADIS netCDF files are (imo) poorly formatted. There is also the issue of xarray.decode_cf()
not concatenating the string arrays even after fixing the _FillValue
, missing_value
conflict (hence requiring passing decode_cf=False
when opening up the MADIS netCDF file). After looking at the decode_cf
code, though, I don't think this is a bug per se (some quick debugging revealed that it doesn't seem like any variable in this netCDF file gets past this check), though if you feel this may in fact be a bug, I can look a bit more into it.
Unfortunately, this does mean I have to do a lot of "manual cleaning" of the netCDF file before exporting as a DataFrame, but it is easy to write a set of functions to accomplish this for you. That said, I can't c/p the exact code (for work-related reasons). I'm not sure how helpful this is, but when working with MADIS netCDF data, I more or less do the following as a workaround:
_FillValue
and missing_value
conflict in the variables.stationId
, dataProvider
).Though reading over it, that is kind of a draw the owl-esque response, though. :/
I thought of something, is the issue here with the unlimited
record dimension?
netcdf \20160430_1600 {
dimensions:
....
recNum = UNLIMITED ; // (2845 currently)
Redeeming myself (only a little bit) from my previous message here:
@akrherz Was messing around with this a bit, this seems to work ok. This gets rid of unnecessary dimensions, concatenates string arrays, and turns it into a pandas DataFrame:
[In [1]: import xarray as xr
In [2]: ds = xr.open_dataset('20160430_1600.nc', decode_cf=True, mask_and_scale=False, decode_times=False) # xarray has issue decoding the times, so you'll have to do this in pandas.
In [3]: vars_to_drop = [k for k in ds.variables.iterkeys() if ('recNum' not in ds[k].dims)]
In [4]: ds = ds.drop(vars_to_drop)
In [5]: df = ds.to_dataframe()
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6277 entries, 0 to 6276
Data columns (total 93 columns):
invTime 6277 non-null int32
prevRecord 6277 non-null int32
isOverflow 6277 non-null int32
secondsStage1_2 6277 non-null int32
secondsStage3 6277 non-null int32
providerId 6277 non-null object
stationId 6277 non-null object
handbook5Id 6277 non-null object](url)
~snip~
A bit hacky, but it works.
@mogismog Awesome, thanks so much for the workaround :)
Something maybe of interest.
I recently converted some tools we have to do the above from Python 2 to 3. When the files were read in the byte chars were not converted to strings. I couldn't actually get this to work on the xarray side and had to loop through the DataFrame columns with apply(.decode("utf-8")) to decode them properly. I'm assuming this might be in the NetCDF4 library, but not 100% sure.
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale
label; otherwise it will be marked as closed automatically
Just to denote that the issue still happens today with numpy=1.18.1, xarray=0.15.0, pandas=1.0.1
>>> df = nc.to_dataframe()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4465, in to_dataframe
return self._to_dataframe(self.dims)
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4451, in _to_dataframe
data = [
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/dataset.py", line 4452, in <listcomp>
self._variables[k].set_dims(ordered_dims).values.reshape(-1)
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/variable.py", line 1345, in set_dims
expanded_data = duck_array_ops.broadcast_to(self.data, tmp_shape)
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/xarray/core/duck_array_ops.py", line 47, in f
return wrapped(*args, **kwargs)
File "<__array_function__ internals>", line 5, in broadcast_to
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 182, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
File "/opt/miniconda3/envs/prod/lib/python3.8/site-packages/numpy/lib/stride_tricks.py", line 125, in _broadcast_to
it = np.nditer(
ValueError: iterator is too large
Getting started with xarray 0.7 (python2.7 RHEL7.2 64bit) and I am having trouble converting a MADIS produced netCDF file to a pandas Dataframe. Here's the traceback
here's the code
Here's the netcdf file
Thanks!