linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

TypeError: Object dtype dtype('O') has no native HDF5 equivalent #12

Closed olgabot closed 6 years ago

olgabot commented 6 years ago

Trying to create a loom dataset here and am getting errors after I convert my pandas DataFrame of cell/gene attributes to a dictionary. I suspect that since pandas does object arrays instead of strings, this may be a problem on my side:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>()

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in create(filename, matrix, row_attrs, col_attrs, file_attrs, chunks, chunk_cache, dtype, compression_opts)
   1036 
   1037         for key, vals in col_attrs.items():
-> 1038                 ds.set_attr(key, vals, axis=1)
   1039 
   1040         for vals in file_attrs:

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in set_attr(self, name, values, axis, dtype)
    561 
    562                 self.delete_attr(name, axis, raise_on_missing=False)
--> 563                 self._save_attr(name, values, axis)
    564                 self._load_attr(name, axis)
    565 

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in _save_attr(self, name, values, axis)
    174                 if self._file[a].__contains__(name):
    175                         del self._file[a + name]
--> 176                 self._file[a + name] = values
    177                 self._file.flush()
    178 

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
    289 
    290             else:
--> 291                 ds = self.create_dataset(None, data=obj, dtype=base.guess_dtype(obj))
    292                 h5o.link(ds.id, self.id, name, lcpl=lcpl)
    293 

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    104         """
    105         with phil:
--> 106             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    107             dset = dataset.Dataset(dsid)
    108             if name is not None:

~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
     98         else:
     99             dtype = numpy.dtype(dtype)
--> 100         tid = h5t.py_create(dtype, logical=1)
    101 
    102     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent
slinnarsson commented 6 years ago

Are these arrays of UTF8 strings? We may want/need to support that. Arrays of arbitrary objects will not be supported (you would have to manually serialize them to e.g. JSON first).

-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 1 Nov 2017, at 19:08, Olga Botvinnik notifications@github.com wrote:

Trying to create a loom dataset here and am getting errors after I convert my pandas DataFrame of cell/gene attributes to a dictionary. I suspect that since pandas does object arrays instead of strings, this may be a problem on my side:


TypeError Traceback (most recent call last)

in () ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in create(filename, matrix, row_attrs, col_attrs, file_attrs, chunks, chunk_cache, dtype, compression_opts) 1036 1037 for key, vals in col_attrs.items(): -> 1038 ds.set_attr(key, vals, axis=1) 1039 1040 for vals in file_attrs: ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in set_attr(self, name, values, axis, dtype) 561 562 self.delete_attr(name, axis, raise_on_missing=False) --> 563 self._save_attr(name, values, axis) 564 self._load_attr(name, axis) 565 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in _save_attr(self, name, values, axis) 174 if self._file[a].__contains__(name): 175 del self._file[a + name] --> 176 self._file[a + name] = values 177 self._file.flush() 178 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj) 289 290 else: --> 291 ds = self.create_dataset(None, data=obj, dtype=base.guess_dtype(obj)) 292 h5o.link(ds.id, self.id, name, lcpl=lcpl) 293 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds) 104 """ 105 with phil: --> 106 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds) 107 dset = dataset.Dataset(dsid) 108 if name is not None: ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times) 98 else: 99 dtype = numpy.dtype(dtype) --> 100 tid = h5t.py_create(dtype, logical=1) 101 102 # Legacy h5py/h5t.pyx in h5py.h5t.py_create() h5py/h5t.pyx in h5py.h5t.py_create() h5py/h5t.pyx in h5py.h5t.py_create() TypeError: Object dtype dtype('O') has no native HDF5 equivalent — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
olgabot commented 6 years ago

Yes, these are UTF8 strings. I filed a bug with xarray with a different problem but there's an example dataset there, too: https://github.com/pydata/xarray/issues/1680

shoyer commented 6 years ago

Just to be clear, these are unicode strings but not UTF-8. Python uses either UTF-16 or UTF-32 for unicode, but those details are pretty well hidden from user level API: https://stackoverflow.com/questions/3547534/what-encoding-do-normal-python-strings-use

slinnarsson commented 6 years ago

I pushed a fix that does more extensive normalization of inputs during create() and set_attr(). You should now be able to pass list, tuple, np.ndarray, np.matrix or scipy.sparse, and the elements can be any kind of string, string object, or number. All will be normalized to conform to the spec.

If the input contains unicode strings, any non-ascii characters are XML entity encoded. E.g. 25 µl will be written as the ascii string 25 &#181;l to the HDF5 file. When read back, any XML entities are unescaped, so that you get the original back. This is all transparent to the Python user: you work with unicode arrays and nevermind how they are stored. At the same time, it ensures interoperability with languages that do not support unicode in HDF5 (such as MATLAB).

You can now directly convert a pandas DataFrame to a row/col dictionary for create(), like so:

import pandas as pd
df = pd.DataFrame({'col1': [1, 2,3], 'col2': [0.5, 0.75, 1]}, index=['a', 'b','c'])
col_attrs = df.to_dict("list")