Naming constraints break compatibility with existing datasets

scverse / spatialdata

An open and interoperable data framework for spatial omics data

https://spatialdata.scverse.org/

BSD 3-Clause "New" or "Revised" License

212 stars 40 forks source link

Naming constraints break compatibility with existing datasets #624

Open aeisenbarth opened 2 months ago

aeisenbarth commented 2 months ago

_check_valid_name implemented stricter naming constraints in https://github.com/scverse/spatialdata/commit/137e1e06c946800599d55c45f18fe8a6a1fb06eb. We already have existing SpatialData datasets where . is used as separator for naming components with different meanings, like Slide1.A2.0.pre_maldi. When loading them with spatialdata 0.2.1, we get the error "Name must contain only alphanumeric characters, underscores, and hyphens". I would understand for Unicode characters, or characters reserved for SpatialData itself (/) or characters problematic on a subset of supported platforms (:, ?, \…).

Is there a reason for not allowing . anymore? Will it be used by SpatialData itself, or the NGFF specification?
What would be a solution, can . be allowed again or should we change the way how we create new datasets and adjust all existing datasets to the new naming constraints?
- _ feels more like a replacement for space within a naming component
- - feels more fitting for joining parts of a naming component, not separating them.
- What allowed character would convey the meaning of separating components of a name?

LucaMarconato commented 1 month ago

Related discussions

First, links to some related topics:

https://github.com/scverse/anndata/issues/1447#issuecomment-2023790500 (comment from @ivirshup that suggests to allow for unicode characters)
https://github.com/scverse/anndata/issues/321 discussion on which names and characters should be allowed, in particular please see the last two messages form @flying-sheep

Rationale behind `_check_valid_name()`

The reason why we removed the support for "non a-zA-Z0-9_- characters" is because we wanted to avoid altogether edge cases that are OS dependent or due to the libraries we are using for IO.

In particular we removed the reason why we removed / and \ is to avoid potential security risks (we rely on third-party libraries for IO, and I expect a name starting with / not to lead to security risks, but I wanted to be extra safe and avoid the case in which a bug in such libraries uncovers an unexpected security risk).

The case of `.`

Instead, the reason why I removed . is because I saw it (rarely) used as a separator for Zarr groups in some datasets. Not a security risk, but I things may break in such scenario. We could consider re-enabling it since usually Zarr uses /.

Symbols for separation

I agree, neither _ nor - suggest "separation". I'd opt for : or | but IIRC they are forbidden in path for Windows. Maybe we could do some research on forbidden paths and see if allowing ,?

aeisenbarth commented 4 weeks ago

Here are my current thoughts about this issue. To solve it, I can investigate the third point, and then we can decide whether there is something to do/change in the code. Maybe @ivirshup's experience can be valuable?

[x] Decision between allow list (a-zA-Z0-9_-) or deny list (allowing all Unicode except characters known to be forbidden by various operating systems).
- As you said, an allow list avoids any unforeseen possibility that a character might cause problems.
- Users who import existing tabular data with mostly Unicode characters (Chinese) would face a huge problem and would have to manually rewrite all column names. There is no automation, information is lost if you replace every character by _ (and causes name collisions).
- On the other side, allowing most of Unicode (except what we forbid) would give users more freedom (and avoid future feature requests by users who need some character). After all, we are in the Unicode age and should have left behind character encoding issues. Of course, best practices for file names apply here just like elsewhere (no sequences of spaces, no control characters, no line breaks…).
- Unicode would still expose users to text encodings issues on misconfigured file systems or servers, causing mangled names (�, Ã¶). Not SpatialData's fault, but could lead to support requests.
[x] Decision whether to extend the current allow list with further characters (candidates to be collected and discusses).
- As an example, the dot . is very common (also in double file extensions), so that the approach of replacing it by a less conventional character like , seems weird. On the other side, your argument of . as Zarr dimension separator (alternative to /) is a very important point.
[x] Investigate whether Zarr dimension separators used unexpectedly in file names cause problems for Zarr. Does Zarr parse file names to extract coordinates, or lookup files by name? Or are the separators only used for ensuring uniqueness of names?
- I don't know whether SpatialData aims to support a specified subset of Zarr (e.g. only / dimension separator, not nested) or any possible Zarr. In its current state, the user constructs SpatialData through the API and all data is written into the SpatialData datastructure, using its defaults. I think a problem could only occur if users can use existing Zarr stores as-is without writing them again.

LucaMarconato commented 4 weeks ago

Thanks @aeisenbarth for summing up the status quo. My vote would be to go for an allow list (eventually extending the current one), I'd ask for the feedback to others and see what is the consensus. In fact, I could imagine also going for a deny list, by combining together the forbid lists for Zarr, Parquet and for filenames in Windows, macOS and UNIX in general.

Regarding the last point, in my experience I have found some bugs sometimes (example 1, example 2) around the dimension separator, so I would be for disallowing . (on top of /).

LucaMarconato commented 4 weeks ago

Kindly asking some other folks to join the discussion. This message summarizes the discussion so you don't to read all of the above.

In your experience with Zarr, AnnData etc, do you think that we should put a constraint on the characters that can be used for the names of the images etc (e.g. [a-zA-Z0-9_-]) or should we have a list of forbidden characters (=better for non-English users)? In the second case, could you share your experience in which characters to forbid?

We want to avoid possible bugs due to clashes between special characters and how internal libraries we depend on deal with them (for instance having a / in a string may lead to problems with AnnData when saving to Zarr as discussed here https://github.com/scverse/anndata/issues/1447). My main motivation is to avoid possible security bugs as we can't test all the scenarios nor predict how each special character may behave with dependent libraries.

Thank you! @giovp @kevinyamauchi @joshmoore @d-v-b @will-moore

aeisenbarth commented 3 weeks ago

My research on the third point:

Zarr distinguishes between "storage keys" (group name or array name) and "chunk keys". The storage key always produces a directory in which chunks are stored. Chunk keys are determined by Zarr and contain the dimension delimiter (/, .). That means storage keys and chunk keys do not occur together in a file name (not key.0.0.0). Dimension delimiter . can be used in storage keys (also /, but it is mapped to the file system's directory delimiter and produces nested groups, which SpatialData does not support).
Zarr v3 allows Unicode with some characters disallowed (character /, prefix __, full names ., .., empty string). It recommends a limited character set a-z, A-Z, 0-9, -, _, . (including dot). Back-slash \ is not mentioned, but it is mapped to forward slash through key normalization. v3#Node names

The v3 spec originally started with a limited character set. → They have a very informative discussion: zarr-spec#56

The v2 spec was full ASCII with some characters implicitly disallowed (/, \).
It does not take over file system restrictions (besides back/forward slash). It relies on the underlying file system to validate and throw an error. These are disallowed on Windows: : < > " | ? *
It does not take over URL restrictions. These are disallowed in URLs (RFC3986 2.2): : / ? # [ ] @ ! $ & ' ( ) * + , ; = and must be percent-encoded. As far as I tested with Python's http.server, Zarr storage keys containing these characters are correctly encoded and served, but the receiving Zarr client does not decode but shows percent-encoded keys.

For SpatialData, I see problems with:

Directory delimitors /, \
Portability between file systems, if users use characters disallowed on another file system. This may go unnoticed when importing long tables e.g. from CSV with such characters in column names.
Read errors when SpatialData is served over the network and contains percent-encoded characters. Names listed in JSON attributes (labels, table column names) will not match with the percent-encoded file names. As a consequence, tables cannot be matched to some regions and an warning is emitted like "The table is annotating 'label1+2', which is not present in the SpatialData object.". When the opened dataset is from a URL, file names need to be decoded. (https://github.com/zarr-developers/zarr-python/issues/2076)

aeisenbarth commented 3 weeks ago

For my original issue, it would be enough to allow ., no matter which other character sets or restrictions are chosen.

d-v-b commented 3 weeks ago

Zarr allows names containing "." characters, and from a purely technical perspective there should be no interference between the names of arrays / groups and the names of chunks, since the encoding of chunk keys in zarr is unambiguously specified in a manner that is completely independent of node names.

LucaMarconato commented 3 weeks ago

Thanks for sharing the above information; I read through the Zarr discussion and it is indeed very informative. My take on this is that we could proceed as Zarr does in https://github.com/zarr-developers/zarr-specs/pull/196/files (which allows unicode characters, provides a short deny list, recommends a normalization approach and recommends to use common characters, such as a-z, A-Z, 0-9, -, _, .); but I think this may create problems downstream.

I would take a more conservative approach since I think that, at least for the moment, we should aim at improving simplicity and robustness, and that allowing unicode would fix some bugs related to datasets that have already been written using now forbidden characters, but would probably lead to new ones because of datasets that can be written and read by one OS/language but not from another, which I think is worse.

Let's wait for feedback from the others and happy to discuss and agree on a path forward together; meanwhile here is my proposal:

[ ] go with the allow-list approach: allow only the characters recommended by the Zarr specs and in addition disallow what Zarr disallows. a-z, A-Z, 0-9, -, _, . would be allowed and these would not: empty strings "", these two strings ., .., things starting with __.
[ ] I would then improve the classes in models.py to validate at construction time against strings that are invalid. For instance we currently don't check the various obs and var names for AnnData objects.
[ ] I would call all the validation functions before writing so we are sure that we don't end up with invalid objects.
[ ] Finally, I would write a post in the discussions explaining how to fix datasets that can't be read (here is an example of one of such discussions: https://github.com/scverse/spatialdata/discussions/657). For instance this dataset here: https://github.com/scverse/spatialdata-io/issues/129 has a var_name that contains a /. In the future such dataset would trigger an error when writing, prompting the user to remove the / manually, but now we would need to explain the user how to fix what has already been written on disk. Practically, I would address this point by closing this issue: https://github.com/scverse/spatialdata-notebooks/issues/106, and write a discussion post only if those datasets lead to problems.

LucaMarconato commented 3 weeks ago

For my original issue, it would be enough to allow ., no matter which other character sets or restrictions are chosen.

In particular the proposal above would allow for the . in the names, fixing the original issue.

giovp commented 2 weeks ago

Thanks for the discussion, just went through this, and I agree that . should be allowed, since indeed while it's used for chunk keys, they don't occur with storage keys. I also agree with this point above

go with the allow-list approach: allow only the characters recommended by the Zarr specs and in addition disallow what Zarr disallows. a-z, A-Z, 0-9, -, _, . would be allowed and these would not: empty strings "", these two strings ., .., things starting with __.

LucaMarconato commented 2 weeks ago

Thanks @giovp. Since the approach I described is conservative and if in the future we end up allowing unicode etc we can always relax, I would implement the approach and close this issue. @aeisenbarth are you free to work on this task (the 4 checkboxes of my post above)? Otherwise I can also work on this.

aeisenbarth commented 6 days ago

Yes, I can do it (sorry, I didn't receive the notification for some reasons).

aeisenbarth commented 3 days ago

There is another complication: The current proposal allows different character cases (as in abc, ABC, Abc) which collide on case-insensitive file systems. Since the chosen approach prefers robustness, we should not leave it to whatever the writer's file system supports, but what any potential reader's file system supports.

Possible solutions would be:

Further restrict the allowed character set to lower-case, e.g. allow {abc}, reject {ABC}
Just require case-insensitive uniqueness of keys, e.g. allow {Abc}, but reject {Abc, abc, ABC}

I couldn't find any mention in the specs, especially the table proposal ngff#64. For AnnData, there is some implementation we could make use of, or further extend instead of implementing on SpatialData side (anndata#321).

I will go for solution 2 because it has the least impact on existing datasets (or table colums imported from CSV). Also I will test to what extent we can rely on existing name validation in AnnData so that we don't have to duplicate that functionality.