Closed nspies-celsiustx closed 1 year ago
The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.
I think the issue here is you have a mixed type column in obs
, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"]
(what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.
The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.
I think the issue here is you have a mixed type column in
obs
, which isn't something we generally support IO for. I'm not sure getting back a column of["1", "2", "3"]
(what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.
Yeah, I'm willing to accept that there isn't a single correct way of doing this, but at the end of the day, I'm getting burned pretty frequently by long-running code that fails in the middle because the file can't be written. I'd much prefer some sort of conversion to occur, especially if it's consistent with how the conversion was being performed previously. Some code we're running seems to return dtype: object series that have only numeric values (something like the result of pd.Series([1,2,3], dtype=object)
).
Would more aggressively converting to categorical, eg for mixed types, be another option?
I think a "proactively convert all contents to something we can write" method could be reasonable.
Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.
But also, do you know why your getting object valued columns?
I think a "proactively convert all contents to something we can write" method could be reasonable.
Yeah, I think this would be great.
Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.
Good suggestion!
But also, do you know why your getting object valued columns?
Obviously this is the short-term solution, but these issues keep on cropping up so ideal to implement the convert-all-contents-to-something-we-can-write method. If you give some pointers on general approach ie where to look, I'm happy to put together a PR.
ideal to implement the convert-all-contents-to-something-we-can-write method.
The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.
An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?
But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object)
, I'm happy to coerce those to integers.
I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype
on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects
. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.
@Zethson and @Imipenem may have some additional insight here, as they're working with quite heterogenous medical record data over at theislab/ehrapy.
The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.
An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?
I'm not entirely against pickling, but it's not a great long-term storage solution as changes to eg pandas could make it impossible to deserialize. I would try as hard as possible to do faithful conversions and as a final fallback convert to pickle but with a warning message noting which slots needed that conversion - and allow reading back the h5ad even if some fields fail to unpickle.
Would more aggressive conversion to categorical work as well?
This idea has been floating around for a bit. But if the elements breaking your write methods are like:
pd.Series([1,2,3], dtype=object)
, I'm happy to coerce those to integers.I think this would basically look like walking the tree of the object, and calling something like
pd.api.types.infer_dtype
on anything that looks like an object. This would be quite similar topd.DataFrame.infer_objects
. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.
Irrespective of other solutions, applying pd.DataFrame.infer_objects()
sounds like a great idea.
I'm not entirely against pickling, but it's not a great long-term storage solution
Yeah, I would say that you have to opt-in to the pickling behavior. And the documentation would say "please don't do this for anything other than ephemeral storage, and even then please don't". This could be a "permissive" writing mode.
and allow reading back the h5ad even if some fields fail to unpickle.
I have also been wanting to have a permissive reading mode. E.g. if you don't recognize the IO spec for a single element, you can skip it.
Would more aggressive conversion to categorical work as well?
I think this is the route ehrapy had been taking, but they have run into issues with it.pd.Series(["a", 2, 3]).astype("category")
would still fail here.
Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.
We could have an AnnData.infer_objects
method. I'm a bit cautious about using the pandas methods directly since I don't really trust the stability of some of their extension types.
i experience the same behaviour, "TypeError: Can't implicitly convert non-string objects to strings" when I'm attempting to write an anndata which has object
, string
, float
columns all containing some NA values ( i need the NA to signal missing samples for multiple columns )
attempting to convert to string or categorical doesn't solve the issue. No method has been defined for writing <class 'pandas.core.arrays.string_.StringArray'> elements to <class 'h5py._hl.group.Group'>
what's the suggested workaround?
thanks!
scanpy: 1.9.1
anndata: 0.8.0
I get the same error after calculating Marker genes (wilcoxon method) and trying to save anndata object again Is there any solution to this?
_TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /_
I have the same error as @Avaptel18 Has any solution been release so far?
The issue reported by @bio-la is tracked in #679 and now is on tracked for inclusion in 0.9.0.
The specific error reported by @avpatel18 could happen for a large number of reasons, some of which are not things we can handle. It may be resolved for some cases which can now be resolved to strings + null values.
I have the same error as @Avaptel18 Has any solution been release so far?
I have the same error as @Avaptel18
Same issue as @Avaptel18 on writing to hdf5:
TypeError: Can't implicitly convert non-string objects to strings
It happens if you calculate marker genes with .rank_genes_groups()
and filter them (sc.tl.filter_rank_genes_groups
): Some elements in adata.uns['rank_genes_groups']
end up as nan
due to filtering and h5py complains.
Any workaround for this? Looked like there's going to be a fix in anndata 0.9., but that was postponed...
@redst4r exactly, it happens after (sc.tl.filter_rank_genes_groups).
so far I am just deleting the .uns key layer before saving the anndata object.
Could someone please update #726 if there is any workaround or once the update comes. Thanks!
OK, then you two run into #679. @ivirshup we should prioritize that one because a lot of people run into it, before we address bigger fish like more aggressive coercion and so on.
Here's a minimal example of the bug I run into:
import scanpy as sc
import anndata
print("scanpy", sc.__version__) # 1.9.3
print("anndata", anndata.__version__) # 0.8.0
adata = sc.datasets.pbmc3k_processed()
sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1) # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /
As mentioned previously, filter_rank_genes_groups()
sets some elements in adata.uns['rank_genes_groups']
to NaN
, causing failure when writing to hdf5.
Here's a minimal example of the bug I run into:
import scanpy as sc import anndata print("scanpy", sc.__version__) # 1.9.3 print("anndata", anndata.__version__) # 0.8.0 adata = sc.datasets.pbmc3k_processed() sc.tl.rank_genes_groups(adata, groupby='louvain') sc.tl.filter_rank_genes_groups(adata, min_fold_change=1) # commenting this line will make the whole thing run without error adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /
As mentioned previously,
filter_rank_genes_groups()
sets some elements inadata.uns['rank_genes_groups']
toNaN
, causing failure when writing to hdf5.
Deleting 'rank_genes_groups' and 'rank_genes_groups_filtered' resolved the issue for me. Thank you!
del adata.uns['rank_genes_groups_filtered'] del adata.uns['rank_genes_groups']
@kaarg2: That "solves" the problem, but of course the DEG won't be saved to disk then, which is very inconvenient.
@flying-sheep: As you mentioned, any anndata produced by scanpy and its functions should really be writable to disk, everything else is a bug. We should really fix that!
I'd take a stab at the problem myself, but the format that sc.tl.rank_genes_groups()
saves DEGs is a bit esoteric (numpy.recarray
s) and I dont want to mess with it too much. Not sure if there's a reason for the choice of recarray over just pandas.DataFrame
s (one per DE-group) (where filtering out rows would be a bit easier), but that's another issue.
In fact the recarray
is a bit of a problem here, it forces each DE-group to have the same amount of genes present (as it's shape is genes x groups
). The filtering would cause different amount of DEGs per group which cant be stored in the recarray now. I guess that's why it gets filled up with NaN entries instead of just deleting those genes.
Looks like the best options at this point would be:
np.recarray
to pd.DataFrames
. Lots of downstream adjustments needed, breaking changes... Something for scanpy 2.0 ...is_filtered
to the adata.uns['rank_genes_groups_filtered']
, marking the filtered genes (rather than setting things to NaN). We just have to make sure downstream functions respect that adata.uns['rank_genes_groups_filtered']['is_filtered']
flag.print("anndata", anndata.__version__) # 0.8.0
that’s not the newest anndata version.
nan
s in string columns are supported. I can’t reproduce your example above with the main version.
Since we’re going to release 0.10 in a little bit, I won’t bother figuring out when this was fixed. It’s either fixed or will be later today.
I'm still dealing with this issue, at the very least is it possible to flag what the problematic obs column is? Running obs.dtypes() shows I have a mix of int/float32/float64/category fields in the pandas object (nothing seems special here), but it is really hard to know which column is the one causing the error since it's such a generic message.
I know it's obs related (and not uns) cause I tried to delete the entire uns layer and it still throw this error, plus the only lead I have in the error points to an obs key error:
98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
--> 617 write_elem(g, "categories", v.categories._values, dataset_kwargs=dataset_kwargs)
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /
There might be some entries in adata.obs or the title of some obs column contain illegal character such as "/". You can try changing those characters in your adata.obs.
Alternatively, you can use pickle. An example can be found here.
@vkartha you’re right, the way the written key is reported is not super helpful right now.
Also getting the same issue. Suggestions from @redst4r make sense and should be implemented
OK, this issue was closed since there are multiple specific sub-issues, and any further conversation here is just confusing.
To avoid further “me too”s, I’m going to link to the related issues a final time and then lock this thread.
Feel free to open new issues if you think something is not covered by the following:
better write errors is https://github.com/scverse/anndata/issues/1272 and will go into the next patch release today or tomorrow
So you’ll see at least the following starting from tomorrow:
Error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /uns/rank_genes_groups_filtered`
I'm reopening https://github.com/theislab/anndata/issues/558 as I'm still having this problem with latest anndata and h5py. It seems that anndata needs to be a bit more defensive about dtypes when writing to h5ad using h5py>=3. I know that most non-numeric columns get converted to categoricals in most situations, but I'm hitting these edge cases with some frequency even with anndata objects that should be large enough to produce categoricals. I suppose some sort of the following logic would solve the issue:
We're mostly running anndata<0.8 so maybe this is fixed there, but I'm also hitting this issue when writing the
.uns
slot, would be great to apply the same clean-up/failsafe logic there.Example with long traceback
```pytb In [1]: import anndata ...: import pandas as pd ...: ...: adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]})) ...: adata.write_h5ad("temp.h5ad") /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py:121: ImplicitModificationWarning: Transforming to str index. warnings.warn("Transforming to str index.", ImplicitModificationWarning) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs) 213 try: --> 214 return func(elem, key, val, *args, **kwargs) 215 except Exception as e: /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs) 171 _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)( --> 172 f, k, elem, *args, **kwargs 173 ) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs) 23 def wrapper(g, k, *args, **kwargs): ---> 24 result = func(g, k, *args, **kwargs) 25 g[k].attrs.setdefault("encoding-type", spec.encoding_type) /opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_vlen_string_array(f, k, elem, dataset_kwargs) 343 str_dtype = h5py.special_dtype(vlen=str) --> 344 f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs) 345 /opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds) 148 --> 149 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds) 150 dset = dataset.Dataset(dsid) /opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter) 144 if (data is not None) and (not isinstance(data, Empty)): --> 145 dset_id.write(h5s.ALL, h5s.ALL, data) 146 h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5d.pyx in h5py.h5d.DatasetID.write() h5py/_proxy.pyx in h5py._proxy.dset_rw() h5py/_conv.pyx in h5py._conv.str2vlen() h5py/_conv.pyx in h5py._conv.generic_converter() h5py/_conv.pyx in h5py._conv.conv_str2vlen() TypeError: Can't implicitly convert non-string objects to strings The above exception was the direct cause of the following exception: TypeError Traceback (most recent call last)and