scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 153 forks source link

Give better errors when writing illegal column names #321

Open flying-sheep opened 4 years ago

flying-sheep commented 4 years ago

There’s a too generic/unhelpful error when writing DataFrame column names that are

52 has more details

ivirshup commented 4 years ago

numeric

Could be better, but I think it gives the important information (what element of the AnnData, and what key) at the end:

Example traceback: ```python --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10)) 2 a.write_h5ad("tmp.h5ad") ----> 3 b = ad.read_h5ad("tmp.h5ad") ~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size) 427 _clean_uns(d) # backwards compat 428 --> 429 return AnnData(**d) 430 431 ~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx) 311 varp=varp, 312 filename=filename, --> 313 filemode=filemode, 314 ) 315 ~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode) 492 493 # TODO: Think about consequences of making obsm a group in hdf --> 494 self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm)) 495 self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm)) 496 ~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals) 229 self._data = dict() 230 if vals is not None: --> 231 self.update(vals) 232 233 /usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds) 839 if isinstance(other, Mapping): 840 for key in other: --> 841 self[key] = other[key] 842 elif hasattr(other, "keys"): 843 for key in other.keys(): ~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value) 148 149 def __setitem__(self, key: str, value: V): --> 150 value = self._validate_value(value, key) 151 self._data[key] = value 152 ~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key) 212 f"value.index does not match parent’s axis {self.axes[0]} names" 213 ) --> 214 return super()._validate_value(val, key) 215 216 ~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key) 47 """Raises an error if value is invalid""" 48 for i, axis in enumerate(self.axes): ---> 49 if self.parent.shape[axis] != val.shape[i]: 50 right_shape = tuple(self.parent.shape[a] for a in self.axes) 51 raise ValueError( AttributeError: 'dict' object has no attribute 'shape' In [4]: a = gen_adata((10, 10)) ...: a.obs[1] = 1 ...: a.write("tmp.h5ad") ...: b = ad.read_h5ad("tmp.h5ad") --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 187 try: --> 188 return func(elem, key, val, *args, **kwargs) 189 except Exception as e: ~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs) 280 else: --> 281 group[key] = series.values 282 /usr/local/lib/python3.7/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj) 369 with phil: --> 370 name, lcpl = self._e(name, lcpl=True) 371 /usr/local/lib/python3.7/site-packages/h5py/_hl/base.py in _e(self, name, lcpl) 136 try: --> 137 name = name.encode('ascii') 138 coding = h5t.CSET_ASCII AttributeError: 'int' object has no attribute 'encode' The above exception was the direct cause of the following exception: AttributeError Traceback (most recent call last) ~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 187 try: --> 188 return func(elem, key, val, *args, **kwargs) 189 except Exception as e: ~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs) 254 for colname, series in df.items(): --> 255 write_series(group, colname, series, dataset_kwargs=dataset_kwargs) 256 ~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 194 f" from {parent}." --> 195 ) from e 196 AttributeError: 'int' object has no attribute 'encode' Above error raised while writing key 1 of from /. The above exception was the direct cause of the following exception: AttributeError Traceback (most recent call last) in 1 a = gen_adata((10, 10)) 2 a.obs[1] = 1 ----> 3 a.write("tmp.h5ad") 4 b = ad.read_h5ad("tmp.h5ad") ~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense) 2008 force_dense=force_dense, 2009 as_dense=as_dense, -> 2010 ) 2011 2012 if self.isbacked: ~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs) 103 else: 104 write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs) --> 105 write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs) 106 write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs) 107 write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs) /usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/functools.py in wrapper(*args, **kw) 838 '1 positional argument') 839 --> 840 return dispatch(args[0].__class__)(*args, **kw) 841 842 funcname = getattr(func, '__name__', 'singledispatch function') ~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs) 124 if key in f: 125 del f[key] --> 126 _write_method(type(value))(f, key, value, *args, **kwargs) 127 128 ~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 193 f"Above error raised while writing key {key!r} of {type(elem)}" 194 f" from {parent}." --> 195 ) from e 196 197 return func_wrapper AttributeError: 'int' object has no attribute 'encode' Above error raised while writing key 1 of from /. Above error raised while writing key 'obs' of from /. ```

'index'

I think this one should work fine now. _index won't work, but I think we give a good error.

containing slashes (unsure if this is still the case)

To my surprise, this one works with dataframes, but not with values in mappings.

import numpy as np
import anndata as ad
from anndata.tests.helpers import gen_adata

a = gen_adata((10, 10))
a.obs["name/with/slashes"] = 1
a.write("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
print(b.obs.columns)
# Index(['obs_cat', 'cat_ordered', 'int64', 'float64', 'uint8',
#       'name/with/slashes'],

a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10))
a.write_h5ad("tmp.h5ad")
b = ad.read_h5ad("tmp.h5ad")
Traceback ```pytb AttributeError Traceback (most recent call last) in 1 a.obsm["name/with/slashes"] = np.ones((a.shape[0], 10)) 2 a.write_h5ad("tmp.h5ad") ----> 3 b = ad.read_h5ad("tmp.h5ad") ~/github/anndata/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size) 427 _clean_uns(d) # backwards compat 428 --> 429 return AnnData(**d) 430 431 ~/github/anndata/anndata/_core/anndata.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, obsp, varp, oidx, vidx) 311 varp=varp, 312 filename=filename, --> 313 filemode=filemode, 314 ) 315 ~/github/anndata/anndata/_core/anndata.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, varp, obsp, raw, layers, dtype, shape, filename, filemode) 492 493 # TODO: Think about consequences of making obsm a group in hdf --> 494 self._obsm = AxisArrays(self, 0, vals=convert_to_dict(obsm)) 495 self._varm = AxisArrays(self, 1, vals=convert_to_dict(varm)) 496 ~/github/anndata/anndata/_core/aligned_mapping.py in __init__(self, parent, axis, vals) 229 self._data = dict() 230 if vals is not None: --> 231 self.update(vals) 232 233 /usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_collections_abc.py in update(*args, **kwds) 839 if isinstance(other, Mapping): 840 for key in other: --> 841 self[key] = other[key] 842 elif hasattr(other, "keys"): 843 for key in other.keys(): ~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value) 148 149 def __setitem__(self, key: str, value: V): --> 150 value = self._validate_value(value, key) 151 self._data[key] = value 152 ~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key) 212 f"value.index does not match parent’s axis {self.axes[0]} names" 213 ) --> 214 return super()._validate_value(val, key) 215 216 ~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key) 47 """Raises an error if value is invalid""" 48 for i, axis in enumerate(self.axes): ---> 49 if self.parent.shape[axis] != val.shape[i]: 50 right_shape = tuple(self.parent.shape[a] for a in self.axes) 51 raise ValueError( AttributeError: 'dict' object has no attribute 'shape' ```

non-unicode:

I don't get an error for that one.

ivirshup commented 4 years ago

Update on name/with/slashes in dataframes:

It works, but I'm not sure it's doing the right thing. Zarr does something similar. Truncated output of h5ls -r tmp.h5ad:

/obs                     Group
/obs/__categories        Group
/obs/__categories/cat_ordered Dataset {10}
/obs/__categories/obs_cat Dataset {7}
/obs/_index              Dataset {10}
/obs/cat_ordered         Dataset {10}
/obs/float64             Dataset {10}
/obs/int64               Dataset {10}
/obs/name                Group
/obs/name/with           Group
/obs/name/with/slashes   Dataset {10}
/obs/obs_cat             Dataset {10}
/obs/uint8               Dataset {10}
zarr["obs"].tree() ``` obs ├── __categories │ ├── cat_ordered (10,) object │ └── obs_cat (7,) object ├── _index (10,) object ├── cat_ordered (10,) int8 ├── float64 (10,) float64 ├── int64 (10,) int64 ├── name │ └── with │ └── slashes (10,) int64 ├── obs_cat (10,) int8 └── uint8 (10,) uint8 ```

Should we normalize or error for these? I'm thinking normalize since I think slashes will be fairly common in column/ group names. E.g. "CD4+/CD8-".

It would be good to use some external standard for normalizing these names. I'm not sure where to find this kind of thing though.

flying-sheep commented 4 years ago
ivirshup commented 4 years ago

Option A mangle/encode the names (e.g. replacing slashes with the unicode division slash /→∕), which modifies the names, so things aren’t as the user specified) and document this mangling in the format documentation.

I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way. Basically, I was hoping we could just escape characters that filesystems treat as special.

I'm not sure if this can actually be done with hdf5. The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google) says group names must be ascii and can't a "." or "/". However, group names can contain a "." and be unicode (mention in other parts), so maybe there's a way for them to include "/'.

Option C ... fix all code that iterates over HDF5-stored dataframe columns

What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.

flying-sheep commented 4 years ago

I was thinking we'd only mangle to and from disk, and it should be a in a completely reversible way.

You mean exactly the way I described above with the division slash? We can of course also escape the division slash by doubling it in the off chance someone uses one as column name, then stuff is reversible.

The current hdf5 user guide (which is conveniently only available as a pdf, so you get the old manual when you google)

fucking amazing, their online PDF viewer doesn’t even support Ctrl+F. Do they actively want to make developers’ lives harder?

I was hoping we could just escape characters that filesystems treat as special

HAhahahaHA “just” 🤪 Oh god, if we had to support filesystems we could either invest days into researching all the legacy crap that went into the 3 OS’s main file systems during decades of computing history or restrict ourselves to [A-Za-z_-] (or so! no idea if dashes and underscores are allowed everywhere! I mean, colons aren’t, so everything is possible 😵)

What are you thinking would be fixed here? Right now, I think it "works" for dataframes. It errors for values in the mapping attributes.

It does? We don’t have code that does for column_dataset in columns_group without checking if column_dataset is a group itself?

ivirshup commented 4 years ago

You mean exactly the way I described above with the division slash?

Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should. I was thinking more of escaping like \/.

Do they actively want to make developers’ lives harder?

Well, they have pivoted to emphasize consulting...

HAhahahaHA “just”

Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that. There is a normalize_key argument, but it might just enforce lowercasing.

It does? We don’t have code that does for column_dataset in columns_group without checking if column_dataset is a group itself?

Since the order of the columns is important for dataframes, we save the names in the correct order in the attributes, then do {k: h5group[k] for k in colnames}. Here's the logic: https://github.com/theislab/anndata/blob/4440b90ff3dff213b4c512478e21426cf210368d/anndata/_io/h5ad.py#L460-L469

flying-sheep commented 4 years ago

I was thinking more of escaping like \/.

I’m pretty sure this won’t work. Escaping is something that depends on the interpreter. If HDF5 group names work like files system inode names in this regard it won’t interpret escape characters, it’ll simply do *group_names, value_name = value_path.split('/'), making it impossible to put a literal slash anywhere in a group name.

Not that, because it becomes non-obvious how to read the group. It would be really annoying for f["gene+/gene-"] to not work when it looks like it should.

As I said, if we go the mangling route, we should document whatever we do and de-mangle group names when reading. But yeah, it’s probably a good idea to not use lookalike characters. We should use an escape character that looks outlandishly unicody and is very unlikely to be used in an old anndata object so we don’t accidentally “unescape” something that was never escaped to begin with. Word uses some like that for their totally-not-LaTeX UnicodeMath (PDF)

Well, they have pivoted to emphasize consulting...

Well, did I ever hit the nail on the head.

Well, I didn't want to have to do it, but figured this would be a common problem for projects like zarr, and we'd just be able to use that.

Ah fuck, right, Zarr does that. Well, if we want to use it more prominently we have to stop using group names to store column names, plain and simple. The pain of figuring out and escaping all known file systems’ disallowed characters is one thing, but then there’s also device files (which on windows don’t have absolute paths and are named e.g. com1), different file systems are case sensitive while others aren’t, starting a column name with / might introduce a security risk, and I bet there’s more hidden nastiness. Just no.