scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
576 stars 152 forks source link

Unclear error message from `write_h5ad` for non-string object arrays #567

Open Munfred opened 3 years ago

Munfred commented 3 years ago

Hello, I found a weird error when saving my anndata with a layer of ints, the full error output is copied below. Turns out that it is caused by a layer of ints, and converting it to int solves the problem for some reason.

I copied the relevant sections of the notebook I was working on to reproduce the error, the notebook is here:

https://colab.research.google.com/gist/Munfred/2d01a63332c09b4f4c6f649305cc4aeb/weird-anndata-error-when-saving-with-layer-of-dataframe-values.ipynb

The specific line that causes the problem is

gene_histogram_adata.layers[gene_id] = gene_histogram_df.values

And just converting it to int solves the problem:

gene_histogram_adata.layers[gene_id] = gene_histogram_df.values.astype(int)

I have no idea why this is happening, and I couldn't reproduced it without actually running the code the way I wrote it, so pardon me for the verbose notebook.

The full error is copied below. It is quite cryptic.

/usr/local/lib/python3.7/dist-packages/anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
  0%|          | 0/1094 [00:00<?, ?it/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

11 frames
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.write()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_conv.pyx in h5py._conv.str2vlen()

h5py/_conv.pyx in h5py._conv.generic_converter()

h5py/_conv.pyx in h5py._conv.conv_str2vlen()

TypeError: Can't implicitly convert non-string objects to strings

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    214                 f"Above error raised while writing key {key!r} of {type(elem)}"
    215                 f" from {parent}."
--> 216             ) from e
    217 
    218     return func_wrapper

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'layers/WBGene00010957' of <class 'h5py._hl.files.File'> from /.
ivirshup commented 3 years ago

What's causing the error is that the dataframe you're using only has object dtypes in it's columns. I believe this is because pandas stores dtypes per column, and you're initializing a dataframe with columns but no values, so it defaults to the most permissive type.

You can get around this entirely by calculating the histogram in one pass then adding labels to the dataframe. Something like:

labels = adata.obs["cell_type"]
hist, _, _ = np.histogram2d(
    x=labels.cat.codes,  # Integer coded cell_types
    y=log10_normalized_expression_in_gene,
    bins=(len(labels.cat.categories), 100),
    range=[
        (0, len(labels.cat.categories)),
        (-10, 0)
    ],
)
gene_histogram_df = pd.DataFrame(hist, index=labels.cat.categories, columns=bin_intervals)
gene_histogram_adata.layers[gene_id] = gene_histogram_df

I do think we should at least give a better error on our end, but I'm not sure what we should do exactly. Right now, we're assuming an object array is strings, which we pass on to hdf5 which makes a similar assumption and errors. We could try harder to infer the dtype, but that could be error prone. Minimum we could do is try to confirm strings are strings with pd.api.types.infer_dtype, and error for any object dtypes that aren't strings. If we wanted to infer dtypes, then we would probably also use this method, but pandas dtypes aren't quite numpy or hdf5 dtypes.

Munfred commented 3 years ago

I see, thanks for the follow up. Perhaps just including on the error message a suggestion with how to convert the datatype manually would be sufficient. I don't think it's a big deal that users need to manually convert to acceptable dtypes, this just wasn't clear to me from the message