Open LucaMarconato opened 3 months ago
@melonora, can you write a column with a /
in it?
When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly.
Basically this just happens to work with how we do columns. The specific group creation behavior could very well be considered a bug.
In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols.
I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).
I recall topics like this being discussed at length in some zarr community calls/ channels. I'd recommend checking over there for any recommendations/ solutions.
I did, the zarr store with current version of spatialdata
and anndata
version 0.10.3 creates an _index
zarr array in the var
zgroup. This is different from what we observe with the steinbock data. In the case when manually adjusting var names to include a /
in the name did not lead to any problems.
@LucaMarconato is the uploaded steinbock date perhaps outdated?
I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).
I agree with you, I would be up for allowing all the characters for this reason. The only reason why I wanted to restrict the name is that I would like to minimize the risk of weird behaviors if the name of the element is interpreted as a path. Things like /
, ../
, C:\\path\
etc. Maybe a solution is to disallow certain characters, like the ones that the OS prevents you to use in filenames.
@melonora, I have just verified. The Steinbock example is up-to-date and it's using the latest anndata
0.10.6.
Did I understand correctly: you can reproduce the bug only when the anndata
io is handled by spatialdata
, but not when using anndata
directly? In that case it could be, maytbe, that somewhere we pass a Zarr group instead of a path, and maybe this causes the Zarr library to create the subpath when creating the Zarr group.
This is the anndata io by spatialdata. With the merfish example when I adjust the var names I get a var zarr group with this:
Reading it back in and checking
table.var
in the sdata object gives me:
So in short when adjusting the table in the merfish example I am able to write to zarr and then read back in.
Thanks for clarifying. But really not sure what's going on here then.
Yeah the behaviour is really different. The encoding_type of var in this case is a dataframe. This is specified in .zattrs in var.
I'm a little confused here. I thought the issue was with column names, not row names?
@melonora, if you do:
from anndata.experimental import read_elem, write_elem
import pandas as pd
import zarr
g = zarr.open("test_df.zarr", "w+")
df = pd.DataFrame({"col / with/ slashes": [1,2,3]})
write_elem(g, "df", df)
from_disk = read_elem(g["df"])
from_disk
what do you get?
This gives an error because of the /
being interpreted as a path separator. The thing that I find weird is that the issue arises with the Steinbock dataset where we have this for table.var
. When I try to replicate this issue by simply adjusting one of the indices in var
to include /
I don't seem to have the problem while for the Steinbock dataset I do.
To be specific this raises a FileNotFoundError
as a result of a path with forward slashes not being found:
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Please make sure these conditions are met
Report
Reported by a
spatialdata
Windows user and reproduced on a Windows machine https://github.com/scverse/spatialdata-io/issues/129. I can't reproduce on my macOS machine or on a Linux machine.When
/
is part of a name of avar
column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly. In Windows the column can't be read, probably because of the difference between/
and\
for paths.In
spatialdata
, we are considering checking all the element names and their respective element columns (e.g.GeoDataFrame
column names,AnnData
obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols. The check would be performed when instantiating an object and in particular before writing, prompting the user for a name change.What are you opinion on this, in particular on restricting the names?
Please see the code and traceback in the attached SpatialData issue, as I can't reproduce on my machine: https://github.com/scverse/spatialdata-io/issues/129.
Versions