Closed olgabot closed 6 years ago
Are these arrays of UTF8 strings? We may want/need to support that. Arrays of arbitrary objects will not be supported (you would have to manually serialize them to e.g. JSON first).
-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)
On 1 Nov 2017, at 19:08, Olga Botvinnik notifications@github.com wrote:
Trying to create a loom dataset here and am getting errors after I convert my pandas DataFrame of cell/gene attributes to a dictionary. I suspect that since pandas does object arrays instead of strings, this may be a problem on my side:
TypeError Traceback (most recent call last)
in () ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in create(filename, matrix, row_attrs, col_attrs, file_attrs, chunks, chunk_cache, dtype, compression_opts) 1036 1037 for key, vals in col_attrs.items(): -> 1038 ds.set_attr(key, vals, axis=1) 1039 1040 for vals in file_attrs: ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in set_attr(self, name, values, axis, dtype) 561 562 self.delete_attr(name, axis, raise_on_missing=False) --> 563 self._save_attr(name, values, axis) 564 self._load_attr(name, axis) 565 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/loompy/loompy.py in _save_attr(self, name, values, axis) 174 if self._file[a].__contains__(name): 175 del self._file[a + name] --> 176 self._file[a + name] = values 177 self._file.flush() 178 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj) 289 290 else: --> 291 ds = self.create_dataset(None, data=obj, dtype=base.guess_dtype(obj)) 292 h5o.link(ds.id, self.id, name, lcpl=lcpl) 293 ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds) 104 """ 105 with phil: --> 106 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds) 107 dset = dataset.Dataset(dsid) 108 if name is not None: ~/miniconda3/envs/loom-env/lib/python3.6/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times) 98 else: 99 dtype = numpy.dtype(dtype) --> 100 tid = h5t.py_create(dtype, logical=1) 101 102 # Legacy h5py/h5t.pyx in h5py.h5t.py_create() h5py/h5t.pyx in h5py.h5t.py_create() h5py/h5t.pyx in h5py.h5t.py_create() TypeError: Object dtype dtype('O') has no native HDF5 equivalent — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Yes, these are UTF8 strings. I filed a bug with xarray with a different problem but there's an example dataset there, too: https://github.com/pydata/xarray/issues/1680
Just to be clear, these are unicode strings but not UTF-8. Python uses either UTF-16 or UTF-32 for unicode, but those details are pretty well hidden from user level API: https://stackoverflow.com/questions/3547534/what-encoding-do-normal-python-strings-use
I pushed a fix that does more extensive normalization of inputs during create() and set_attr(). You should now be able to pass list, tuple, np.ndarray, np.matrix or scipy.sparse, and the elements can be any kind of string, string object, or number. All will be normalized to conform to the spec.
If the input contains unicode strings, any non-ascii characters are XML entity encoded. E.g. 25 µl
will be written as the ascii string 25 µl
to the HDF5 file. When read back, any XML entities are unescaped, so that you get the original back. This is all transparent to the Python user: you work with unicode arrays and nevermind how they are stored. At the same time, it ensures interoperability with languages that do not support unicode in HDF5 (such as MATLAB).
You can now directly convert a pandas DataFrame to a row/col dictionary for create(), like so:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2,3], 'col2': [0.5, 0.75, 1]}, index=['a', 'b','c'])
col_attrs = df.to_dict("list")
Trying to create a
loom
dataset here and am getting errors after I convert mypandas
DataFrame of cell/gene attributes to a dictionary. I suspect that sincepandas
does object arrays instead of strings, this may be a problem on my side: