Open Munfred opened 3 years ago
What's causing the error is that the dataframe you're using only has object
dtypes in it's columns. I believe this is because pandas stores dtypes per column, and you're initializing a dataframe with columns but no values, so it defaults to the most permissive type.
You can get around this entirely by calculating the histogram in one pass then adding labels to the dataframe. Something like:
labels = adata.obs["cell_type"]
hist, _, _ = np.histogram2d(
x=labels.cat.codes, # Integer coded cell_types
y=log10_normalized_expression_in_gene,
bins=(len(labels.cat.categories), 100),
range=[
(0, len(labels.cat.categories)),
(-10, 0)
],
)
gene_histogram_df = pd.DataFrame(hist, index=labels.cat.categories, columns=bin_intervals)
gene_histogram_adata.layers[gene_id] = gene_histogram_df
I do think we should at least give a better error on our end, but I'm not sure what we should do exactly. Right now, we're assuming an object array is strings, which we pass on to hdf5 which makes a similar assumption and errors. We could try harder to infer the dtype, but that could be error prone. Minimum we could do is try to confirm strings are strings with pd.api.types.infer_dtype
, and error for any object
dtypes that aren't strings. If we wanted to infer dtypes, then we would probably also use this method, but pandas dtypes aren't quite numpy or hdf5 dtypes.
I see, thanks for the follow up. Perhaps just including on the error message a suggestion with how to convert the datatype manually would be sufficient. I don't think it's a big deal that users need to manually convert to acceptable dtypes, this just wasn't clear to me from the message
Hello, I found a weird error when saving my anndata with a layer of ints, the full error output is copied below. Turns out that it is caused by a layer of ints, and converting it to int solves the problem for some reason.
I copied the relevant sections of the notebook I was working on to reproduce the error, the notebook is here:
https://colab.research.google.com/gist/Munfred/2d01a63332c09b4f4c6f649305cc4aeb/weird-anndata-error-when-saving-with-layer-of-dataframe-values.ipynb
The specific line that causes the problem is
And just converting it to int solves the problem:
I have no idea why this is happening, and I couldn't reproduced it without actually running the code the way I wrote it, so pardon me for the verbose notebook.
The full error is copied below. It is quite cryptic.