scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 154 forks source link

No support for mixed column type #726

Closed nspies-celsiustx closed 1 year ago

nspies-celsiustx commented 2 years ago

I'm reopening https://github.com/theislab/anndata/issues/558 as I'm still having this problem with latest anndata and h5py. It seems that anndata needs to be a bit more defensive about dtypes when writing to h5ad using h5py>=3. I know that most non-numeric columns get converted to categoricals in most situations, but I'm hitting these edge cases with some frequency even with anndata objects that should be large enough to produce categoricals. I suppose some sort of the following logic would solve the issue:

  1. perform the categorical conversion
  2. check for non-numeric, non-categorical columns, convert to {strings/categories?} - it looks like from the previous issue, this is what h5py<3 used to do?

We're mostly running anndata<0.8 so maybe this is fixed there, but I'm also hitting this issue when writing the .uns slot, would be great to apply the same clean-up/failsafe logic there.

Example with long traceback ```pytb In [1]: import anndata ...: import pandas as pd ...: ...: adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]})) ...: adata.write_h5ad("temp.h5ad") /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py:121: ImplicitModificationWarning: Transforming to str index. warnings.warn("Transforming to str index.", ImplicitModificationWarning) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 213 try: --> 214 return func(elem, key, val, *args, **kwargs) 215 except Exception as e: /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs) 171 _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)( --> 172 f, k, elem, *args, **kwargs 173 ) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs) 23 def wrapper(g, k, *args, **kwargs): ---> 24 result = func(g, k, *args, **kwargs) 25 g[k].attrs.setdefault("encoding-type", spec.encoding_type) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_vlen_string_array(f, k, elem, dataset_kwargs) 343 str_dtype = h5py.special_dtype(vlen=str) --> 344 f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs) 345 /opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds) 148 --> 149 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds) 150 dset = dataset.Dataset(dsid) /opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter) 144 if (data is not None) and (not isinstance(data, Empty)): --> 145 dset_id.write(h5s.ALL, h5s.ALL, data) 146 h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5d.pyx in h5py.h5d.DatasetID.write() h5py/_proxy.pyx in h5py._proxy.dset_rw() h5py/_conv.pyx in h5py._conv.str2vlen() h5py/_conv.pyx in h5py._conv.generic_converter() h5py/_conv.pyx in h5py._conv.conv_str2vlen() TypeError: Can't implicitly convert non-string objects to strings The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last) in 3 4 adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]})) ----> 5 adata.write_h5ad("temp.h5ad") /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense) 1922 compression_opts=compression_opts, 1923 force_dense=force_dense, -> 1924 as_dense=as_dense, 1925 ) 1926 /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs) 96 elif adata.raw is not None: 97 write_elem(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs) ---> 98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs) 99 write_elem(f, "var", adata.var, dataset_kwargs=dataset_kwargs) 100 write_elem(f, "obsm", dict(adata.obsm), dataset_kwargs=dataset_kwargs) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 212 def func_wrapper(elem, key, val, *args, **kwargs): 213 try: --> 214 return func(elem, key, val, *args, **kwargs) 215 except Exception as e: 216 if "Above error raised while writing key" in format(e): /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs) 173 ) 174 else: --> 175 _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs) 176 177 /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs) 22 @wraps(func) 23 def wrapper(g, k, *args, **kwargs): ---> 24 result = func(g, k, *args, **kwargs) 25 g[k].attrs.setdefault("encoding-type", spec.encoding_type) 26 g[k].attrs.setdefault("encoding-version", spec.encoding_version) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_dataframe(f, key, df, dataset_kwargs) 510 for colname, series in df.items(): 511 # TODO: this should write the "true" representation of the series (i.e. the underlying array or ndarray depending) --> 512 write_elem(group, colname, series._values, dataset_kwargs=dataset_kwargs) 513 514 /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 222 f"Above error raised while writing key {key!r} of {type(elem)} " 223 f"to {parent}" --> 224 ) from e 225 226 return func_wrapper TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'a' of to / ```

and

In [4]: anndata.__version__, h5py.__version__
Out[4]: ('0.8.0rc1', '3.6.0')
ivirshup commented 2 years ago

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

nspies-celsiustx commented 2 years ago

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

Yeah, I'm willing to accept that there isn't a single correct way of doing this, but at the end of the day, I'm getting burned pretty frequently by long-running code that fails in the middle because the file can't be written. I'd much prefer some sort of conversion to occur, especially if it's consistent with how the conversion was being performed previously. Some code we're running seems to return dtype: object series that have only numeric values (something like the result of pd.Series([1,2,3], dtype=object)).

Would more aggressively converting to categorical, eg for mixed types, be another option?

ivirshup commented 2 years ago

I think a "proactively convert all contents to something we can write" method could be reasonable.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

But also, do you know why your getting object valued columns?

nspies-celsiustx commented 2 years ago

I think a "proactively convert all contents to something we can write" method could be reasonable.

Yeah, I think this would be great.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

Good suggestion!

But also, do you know why your getting object valued columns?

Obviously this is the short-term solution, but these issues keep on cropping up so ideal to implement the convert-all-contents-to-something-we-can-write method. If you give some pointers on general approach ie where to look, I'm happy to put together a PR.

ivirshup commented 2 years ago

ideal to implement the convert-all-contents-to-something-we-can-write method.

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

This idea has been floating around for a bit. Comments from some old code: ```python File ~/anaconda3/envs/scvelo2/lib/python3.8/site-packages/anndata/_io/h5ad.py:148, in write_not_implemented(f, key, value, dataset_kwargs) 143 @report_write_key_on_error 144 def write_not_implemented(f, key, value, dataset_kwargs=MappingProxyType({})): 145 # If it’s not an array, try and make it an array. If that fails, pickle it. 146 # Maybe rethink that, maybe this should just pickle, 147 # and have explicit implementations for everything else --> 148 raise NotImplementedError( 149 f"Failed to write value for {key}, " 150 f"since a writer for type {type(value)} has not been implemented yet." 151 ) ```

But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.


@Zethson and @Imipenem may have some additional insight here, as they're working with quite heterogenous medical record data over at theislab/ehrapy.

nspies-celsiustx commented 2 years ago

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

I'm not entirely against pickling, but it's not a great long-term storage solution as changes to eg pandas could make it impossible to deserialize. I would try as hard as possible to do faithful conversions and as a final fallback convert to pickle but with a warning message noting which slots needed that conversion - and allow reading back the h5ad even if some fields fail to unpickle.

Would more aggressive conversion to categorical work as well?

This idea has been floating around for a bit. But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

ivirshup commented 2 years ago

I'm not entirely against pickling, but it's not a great long-term storage solution

Yeah, I would say that you have to opt-in to the pickling behavior. And the documentation would say "please don't do this for anything other than ephemeral storage, and even then please don't". This could be a "permissive" writing mode.

and allow reading back the h5ad even if some fields fail to unpickle.

I have also been wanting to have a permissive reading mode. E.g. if you don't recognize the IO spec for a single element, you can skip it.


Would more aggressive conversion to categorical work as well?

I think this is the route ehrapy had been taking, but they have run into issues with it.pd.Series(["a", 2, 3]).astype("category") would still fail here.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

We could have an AnnData.infer_objects method. I'm a bit cautious about using the pandas methods directly since I don't really trust the stability of some of their extension types.

bio-la commented 1 year ago

i experience the same behaviour, "TypeError: Can't implicitly convert non-string objects to strings" when I'm attempting to write an anndata which has object, string, float columns all containing some NA values ( i need the NA to signal missing samples for multiple columns ) attempting to convert to string or categorical doesn't solve the issue. No method has been defined for writing <class 'pandas.core.arrays.string_.StringArray'> elements to <class 'h5py._hl.group.Group'> what's the suggested workaround? thanks!

scanpy: 1.9.1
anndata: 0.8.0
avpatel18 commented 1 year ago

I get the same error after calculating Marker genes (wilcoxon method) and trying to save anndata object again Is there any solution to this?

_TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /_

nroak commented 1 year ago

I have the same error as @Avaptel18 Has any solution been release so far?

ivirshup commented 1 year ago

The issue reported by @bio-la is tracked in #679 and now is on tracked for inclusion in 0.9.0.

The specific error reported by @avpatel18 could happen for a large number of reasons, some of which are not things we can handle. It may be resolved for some cases which can now be resolved to strings + null values.

Niubile001 commented 1 year ago

I have the same error as @Avaptel18 Has any solution been release so far?

I have the same error as @Avaptel18

redst4r commented 1 year ago

Same issue as @Avaptel18 on writing to hdf5:

TypeError: Can't implicitly convert non-string objects to strings

It happens if you calculate marker genes with .rank_genes_groups() and filter them (sc.tl.filter_rank_genes_groups): Some elements in adata.uns['rank_genes_groups'] end up as nan due to filtering and h5py complains.

Any workaround for this? Looked like there's going to be a fix in anndata 0.9., but that was postponed...

avpatel18 commented 1 year ago

@redst4r exactly, it happens after (sc.tl.filter_rank_genes_groups).

so far I am just deleting the .uns key layer before saving the anndata object.

Could someone please update #726 if there is any workaround or once the update comes. Thanks!

flying-sheep commented 1 year ago

OK, then you two run into #679. @ivirshup we should prioritize that one because a lot of people run into it, before we address bigger fish like more aggressive coercion and so on.

redst4r commented 1 year ago

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

kaarg2 commented 1 year ago

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

Deleting 'rank_genes_groups' and 'rank_genes_groups_filtered' resolved the issue for me. Thank you!

del adata.uns['rank_genes_groups_filtered'] del adata.uns['rank_genes_groups']

redst4r commented 1 year ago

@kaarg2: That "solves" the problem, but of course the DEG won't be saved to disk then, which is very inconvenient.

@flying-sheep: As you mentioned, any anndata produced by scanpy and its functions should really be writable to disk, everything else is a bug. We should really fix that!

I'd take a stab at the problem myself, but the format that sc.tl.rank_genes_groups() saves DEGs is a bit esoteric (numpy.recarrays) and I dont want to mess with it too much. Not sure if there's a reason for the choice of recarray over just pandas.DataFrames (one per DE-group) (where filtering out rows would be a bit easier), but that's another issue.

In fact the recarray is a bit of a problem here, it forces each DE-group to have the same amount of genes present (as it's shape is genes x groups). The filtering would cause different amount of DEGs per group which cant be stored in the recarray now. I guess that's why it gets filled up with NaN entries instead of just deleting those genes.

Looks like the best options at this point would be:

  1. change the storage format of DEG from (the rather unnatural) np.recarray to pd.DataFrames. Lots of downstream adjustments needed, breaking changes... Something for scanpy 2.0 ...
  2. or add a column is_filtered to the adata.uns['rank_genes_groups_filtered'], marking the filtered genes (rather than setting things to NaN). We just have to make sure downstream functions respect that adata.uns['rank_genes_groups_filtered']['is_filtered'] flag.
flying-sheep commented 1 year ago
print("anndata", anndata.__version__)  # 0.8.0

that’s not the newest anndata version.

nans in string columns are supported. I can’t reproduce your example above with the main version.

Since we’re going to release 0.10 in a little bit, I won’t bother figuring out when this was fixed. It’s either fixed or will be later today.

vkartha commented 11 months ago

I'm still dealing with this issue, at the very least is it possible to flag what the problematic obs column is? Running obs.dtypes() shows I have a mix of int/float32/float64/category fields in the pandas object (nothing seems special here), but it is really hard to know which column is the one causing the error since it's such a generic message.

I know it's obs related (and not uns) cause I tried to delete the entire uns layer and it still throw this error, plus the only lead I have in the error points to an obs key error:

98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs) --> 617 write_elem(g, "categories", v.categories._values, dataset_kwargs=dataset_kwargs)

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /
kaarg2 commented 11 months ago

There might be some entries in adata.obs or the title of some obs column contain illegal character such as "/". You can try changing those characters in your adata.obs.

Alternatively, you can use pickle. An example can be found here.

flying-sheep commented 11 months ago

@vkartha you’re right, the way the written key is reported is not super helpful right now.

I filed https://github.com/scverse/anndata/issues/1272

Rafael-Silva-Oliveira commented 10 months ago

Also getting the same issue. Suggestions from @redst4r make sense and should be implemented

flying-sheep commented 10 months ago

OK, this issue was closed since there are multiple specific sub-issues, and any further conversation here is just confusing.

To avoid further “me too”s, I’m going to link to the related issues a final time and then lock this thread.

Feel free to open new issues if you think something is not covered by the following: