Open aeisenbarth opened 2 months ago
First, links to some related topics:
_check_valid_name()
The reason why we removed the support for "non a-zA-Z0-9_- characters" is because we wanted to avoid altogether edge cases that are OS dependent or due to the libraries we are using for IO.
In particular we removed the reason why we removed /
and \
is to avoid potential security risks (we rely on third-party libraries for IO, and I expect a name starting with /
not to lead to security risks, but I wanted to be extra safe and avoid the case in which a bug in such libraries uncovers an unexpected security risk).
.
Instead, the reason why I removed .
is because I saw it (rarely) used as a separator for Zarr groups in some datasets. Not a security risk, but I things may break in such scenario. We could consider re-enabling it since usually Zarr uses /
.
I agree, neither _
nor -
suggest "separation". I'd opt for :
or |
but IIRC they are forbidden in path for Windows. Maybe we could do some research on forbidden paths and see if allowing ,
?
Here are my current thoughts about this issue. To solve it, I can investigate the third point, and then we can decide whether there is something to do/change in the code. Maybe @ivirshup's experience can be valuable?
[x] Decision between allow list (a-zA-Z0-9_-
) or deny list (allowing all Unicode except characters known to be forbidden by various operating systems).
_
(and causes name collisions).�
, ö
). Not SpatialData's fault, but could lead to support requests.[x] Decision whether to extend the current allow list with further characters (candidates to be collected and discusses).
.
is very common (also in double file extensions), so that the approach of replacing it by a less conventional character like ,
seems weird. On the other side, your argument of .
as Zarr dimension separator (alternative to /
) is a very important point.[x] Investigate whether Zarr dimension separators used unexpectedly in file names cause problems for Zarr. Does Zarr parse file names to extract coordinates, or lookup files by name? Or are the separators only used for ensuring uniqueness of names?
/
dimension separator, not nested) or any possible Zarr. In its current state, the user constructs SpatialData through the API and all data is written into the SpatialData datastructure, using its defaults. I think a problem could only occur if users can use existing Zarr stores as-is without writing them again.Thanks @aeisenbarth for summing up the status quo. My vote would be to go for an allow list (eventually extending the current one), I'd ask for the feedback to others and see what is the consensus. In fact, I could imagine also going for a deny list, by combining together the forbid lists for Zarr, Parquet and for filenames in Windows, macOS and UNIX in general.
Regarding the last point, in my experience I have found some bugs sometimes (example 1, example 2) around the dimension separator, so I would be for disallowing .
(on top of /
).
Kindly asking some other folks to join the discussion. This message summarizes the discussion so you don't to read all of the above.
In your experience with Zarr, AnnData etc, do you think that we should put a constraint on the characters that can be used for the names of the images etc (e.g. [a-zA-Z0-9_-]
) or should we have a list of forbidden characters (=better for non-English users)? In the second case, could you share your experience in which characters to forbid?
We want to avoid possible bugs due to clashes between special characters and how internal libraries we depend on deal with them (for instance having a /
in a string may lead to problems with AnnData
when saving to Zarr
as discussed here https://github.com/scverse/anndata/issues/1447). My main motivation is to avoid possible security bugs as we can't test all the scenarios nor predict how each special character may behave with dependent libraries.
Thank you! @giovp @kevinyamauchi @joshmoore @d-v-b @will-moore
My research on the third point:
Zarr distinguishes between "storage keys" (group name or array name) and "chunk keys". The storage key always produces a directory in which chunks are stored. Chunk keys are determined by Zarr and contain the dimension delimiter (/
, .
). That means storage keys and chunk keys do not occur together in a file name (not key.0.0.0
).
Dimension delimiter .
can be used in storage keys (also /
, but it is mapped to the file system's directory delimiter and produces nested groups, which SpatialData does not support).
Zarr v3 allows Unicode with some characters disallowed (character /
, prefix __
, full names .
, ..
, empty string). It recommends a limited character set a-z
, A-Z
, 0-9
, -
, _
, .
(including dot). Back-slash \
is not mentioned, but it is mapped to forward slash through key normalization. v3#Node names
The v3 spec originally started with a limited character set. → They have a very informative discussion: zarr-spec#56
The v2 spec was full ASCII with some characters implicitly disallowed (/
, \
).
It does not take over file system restrictions (besides back/forward slash). It relies on the underlying file system to validate and throw an error.
These are disallowed on Windows: :
<
>
"
|
?
*
It does not take over URL restrictions.
These are disallowed in URLs (RFC3986 2.2): :
/
?
#
[
]
@
!
$
&
'
(
)
*
+
,
;
=
and must be percent-encoded.
As far as I tested with Python's http.server
, Zarr storage keys containing these characters are correctly encoded and served, but the receiving Zarr client does not decode but shows percent-encoded keys.
For SpatialData, I see problems with:
/
, \
For my original issue, it would be enough to allow .
, no matter which other character sets or restrictions are chosen.
Zarr allows names containing "." characters, and from a purely technical perspective there should be no interference between the names of arrays / groups and the names of chunks, since the encoding of chunk keys in zarr is unambiguously specified in a manner that is completely independent of node names.
Thanks for sharing the above information; I read through the Zarr discussion and it is indeed very informative. My take on this is that we could proceed as Zarr does in https://github.com/zarr-developers/zarr-specs/pull/196/files (which allows unicode characters, provides a short deny list, recommends a normalization approach and recommends to use common characters, such as a-z
, A-Z
, 0-9
, -
, _
, .
); but I think this may create problems downstream.
I would take a more conservative approach since I think that, at least for the moment, we should aim at improving simplicity and robustness, and that allowing unicode would fix some bugs related to datasets that have already been written using now forbidden characters, but would probably lead to new ones because of datasets that can be written and read by one OS/language but not from another, which I think is worse.
Let's wait for feedback from the others and happy to discuss and agree on a path forward together; meanwhile here is my proposal:
a-z
, A-Z
, 0-9
, -
, _
, .
would be allowed and these would not: empty strings ""
, these two strings .
, ..
, things starting with __
. models.py
to validate at construction time against strings that are invalid. For instance we currently don't check the various obs and var names for AnnData
objects.var_name
that contains a /
. In the future such dataset would trigger an error when writing, prompting the user to remove the /
manually, but now we would need to explain the user how to fix what has already been written on disk. Practically, I would address this point by closing this issue: https://github.com/scverse/spatialdata-notebooks/issues/106, and write a discussion post only if those datasets lead to problems.For my original issue, it would be enough to allow ., no matter which other character sets or restrictions are chosen.
In particular the proposal above would allow for the .
in the names, fixing the original issue.
Thanks for the discussion, just went through this, and I agree that .
should be allowed, since indeed while it's used for chunk keys, they don't occur with storage keys. I also agree with this point above
go with the allow-list approach: allow only the characters recommended by the Zarr specs and in addition disallow what Zarr disallows. a-z, A-Z, 0-9, -, _, . would be allowed and these would not: empty strings "", these two strings ., .., things starting with __.
Thanks @giovp. Since the approach I described is conservative and if in the future we end up allowing unicode etc we can always relax, I would implement the approach and close this issue. @aeisenbarth are you free to work on this task (the 4 checkboxes of my post above)? Otherwise I can also work on this.
Yes, I can do it (sorry, I didn't receive the notification for some reasons).
There is another complication: The current proposal allows different character cases (as in abc
, ABC
, Abc
) which collide on case-insensitive file systems. Since the chosen approach prefers robustness, we should not leave it to whatever the writer's file system supports, but what any potential reader's file system supports.
Possible solutions would be:
abc
}, reject {ABC
}Abc
}, but reject {Abc
, abc
, ABC
}I couldn't find any mention in the specs, especially the table proposal ngff#64. For AnnData, there is some implementation we could make use of, or further extend instead of implementing on SpatialData side (anndata#321).
I will go for solution 2 because it has the least impact on existing datasets (or table colums imported from CSV). Also I will test to what extent we can rely on existing name validation in AnnData so that we don't have to duplicate that functionality.
I noticed the previous implementation was not restricted to a-zA-Z0-9
(+_-
), but isalnum which considers Unicode letters like é
and 福
as alphanumeric. Now changing to more restrictive a-zA-Z0-9
, following Zarr v3 recommendation.
(Still it would be an annoyance if users want to add columns containing µm
or Chinese names and have to rename them first.)
_check_valid_name
implemented stricter naming constraints in https://github.com/scverse/spatialdata/commit/137e1e06c946800599d55c45f18fe8a6a1fb06eb. We already have existing SpatialData datasets where.
is used as separator for naming components with different meanings, likeSlide1.A2.0.pre_maldi
. When loading them with spatialdata 0.2.1, we get the error "Name must contain only alphanumeric characters, underscores, and hyphens". I would understand for Unicode characters, or characters reserved for SpatialData itself (/
) or characters problematic on a subset of supported platforms (:
,?
,\
…).Is there a reason for not allowing
.
anymore? Will it be used by SpatialData itself, or the NGFF specification?What would be a solution, can
.
be allowed again or should we change the way how we create new datasets and adjust all existing datasets to the new naming constraints?_
feels more like a replacement for space-
feels more fitting for joining parts of a naming component, not separating them.