pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Save arbitrary Python objects to netCDF #1415

Open lewisacidic opened 7 years ago

lewisacidic commented 7 years ago

I am looking to transition from pandas to xarray, and the only feature that I am really missing is the ability to seamlessly save arrays of python objects to hdf5 (or netCDF). This might be an issue for the backend netCDF4 libraries instead, but I thought I would post it here first to see what the opinions were about this functionality.

For context, Pandas allows this by using pytables' ObjectAtom to serialize the object using pickle, then saves as a variable length bytes data type. It is already possible to do this using netCDF4, by applying to each object in the array np.fromstring(pickle.dumps(obj), dtype=np.uint8), and saving these using a uint8 VLType. Then retrieving is simply pickle.reads(obj.tostring()) for each array.

I know pickle can be a security problem, it can cause an problem if you try to save a numerical array that accidently has dtype=object (pandas gives a warning), and that this is probably quite slow (I think pandas pickles a list containing all the objects for speed), but it would be incredibly convenient.

shoyer commented 7 years ago

I would be OK with this if it required explicitly setting a keyword argument, e.g., ds.to_netcdf(..., allow_pickle=True) and xarray.open_dataset(..., allow_pickle=True). This could be hooked into xarray's existing coding/decoding layer in a relatively straightforward fashion: see ensure_dtype_not_object for where this is caught in the current code. (We would also need something at a lower level in the netCDF4 specific reader/writer to handle uint8 VLType.)

lewisacidic commented 7 years ago

I would certainly be interested in giving this a try, although I'm not exactly sure what would go where yet. It seems like this might possibly be something that would be more appropriate in the netCDF4-python library - should I start an issue over there?

shoyer commented 7 years ago

Sure, there's no harm in asking. My guess is that this isn't a good fit, but I'm not entirely sure.

lewisacidic commented 7 years ago

Yeah, looking at it, it's probably not a thing for them. I thought something like:

# implement something like
# strs = nc.createVariable('strs', str, ('strs_dim',))
objs = nc.createVariable('objs', object, ('objs_dim',))

But I see that the str datatype is a netCDF spec type.

stale[bot] commented 5 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically