scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
535 stars 150 forks source link

`/` in column names makes AnnData Zarr object unreadable on windows #1447

Open LucaMarconato opened 3 months ago

LucaMarconato commented 3 months ago

Please make sure these conditions are met

Report

Reported by a spatialdata Windows user and reproduced on a Windows machine https://github.com/scverse/spatialdata-io/issues/129. I can't reproduce on my macOS machine or on a Linux machine.

When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly. In Windows the column can't be read, probably because of the difference between / and \ for paths.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols. The check would be performed when instantiating an object and in particular before writing, prompting the user for a name change.

What are you opinion on this, in particular on restricting the names?

Please see the code and traceback in the attached SpatialData issue, as I can't reproduce on my machine: https://github.com/scverse/spatialdata-io/issues/129.

Versions

-----
anndata             0.10.4
session_info        1.0.0
-----
asciitree           NA
cloudpickle         3.0.0
cython_runtime      NA
cytoolz             0.12.2
dask                2023.12.1
dateutil            2.8.2
dill                0.3.7
exceptiongroup      1.2.0
fasteners           0.17.3
gmpy2               2.1.2
h5py                3.9.0
importlib_metadata  NA
jinja2              3.1.2
markupsafe          2.1.3
mpmath              1.3.0
msgpack             1.0.7
natsort             8.4.0
numcodecs           0.12.1
numpy               1.26.3
packaging           23.2
pandas              2.1.4
psutil              5.9.7
pyarrow             14.0.1
pytz                2023.3.post1
scipy               1.11.4
six                 1.16.0
sphinxcontrib       NA
sympy               1.12
tblib               3.0.0
tlz                 0.12.2
toolz               0.12.0
torch               2.1.2
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
xxhash              NA
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2024-03-27 16:54
ivirshup commented 3 months ago

@melonora, can you write a column with a / in it?


When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly.

Basically this just happens to work with how we do columns. The specific group creation behavior could very well be considered a bug.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols.

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).


I recall topics like this being discussed at length in some zarr community calls/ channels. I'd recommend checking over there for any recommendations/ solutions.

melonora commented 3 months ago

I did, the zarr store with current version of spatialdata and anndata version 0.10.3 creates an _index zarr array in the var zgroup. This is different from what we observe with the steinbock data. In the case when manually adjusting var names to include a / in the name did not lead to any problems.

melonora commented 3 months ago

@LucaMarconato is the uploaded steinbock date perhaps outdated?

LucaMarconato commented 3 months ago

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).

I agree with you, I would be up for allowing all the characters for this reason. The only reason why I wanted to restrict the name is that I would like to minimize the risk of weird behaviors if the name of the element is interpreted as a path. Things like /, ../, C:\\path\ etc. Maybe a solution is to disallow certain characters, like the ones that the OS prevents you to use in filenames.

LucaMarconato commented 3 months ago

@melonora, I have just verified. The Steinbock example is up-to-date and it's using the latest anndata 0.10.6.

Did I understand correctly: you can reproduce the bug only when the anndata io is handled by spatialdata, but not when using anndata directly? In that case it could be, maytbe, that somewhere we pass a Zarr group instead of a path, and maybe this causes the Zarr library to create the subpath when creating the Zarr group.

melonora commented 3 months ago

This is the anndata io by spatialdata. With the merfish example when I adjust the var names I get a var zarr group with this: afbeelding Reading it back in and checking table.var in the sdata object gives me: afbeelding

So in short when adjusting the table in the merfish example I am able to write to zarr and then read back in.

LucaMarconato commented 3 months ago

Thanks for clarifying. But really not sure what's going on here then.

melonora commented 3 months ago

Yeah the behaviour is really different. The encoding_type of var in this case is a dataframe. This is specified in .zattrs in var.

ivirshup commented 3 months ago

I'm a little confused here. I thought the issue was with column names, not row names?

@melonora, if you do:

from anndata.experimental import read_elem, write_elem

import pandas as pd
import zarr

g = zarr.open("test_df.zarr", "w+")
df = pd.DataFrame({"col / with/ slashes": [1,2,3]})
write_elem(g, "df", df)
from_disk = read_elem(g["df"])

from_disk

what do you get?

melonora commented 3 months ago

This gives an error because of the / being interpreted as a path separator. The thing that I find weird is that the issue arises with the Steinbock dataset where we have this for table.var. When I try to replicate this issue by simply adjusting one of the indices in var to include / I don't seem to have the problem while for the Steinbock dataset I do.

melonora commented 3 months ago

To be specific this raises a FileNotFoundError as a result of a path with forward slashes not being found: afbeelding

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!