scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
212 stars 40 forks source link

Naming constraints break compatibility with existing datasets #624

Open aeisenbarth opened 2 months ago

aeisenbarth commented 2 months ago

_check_valid_name implemented stricter naming constraints in https://github.com/scverse/spatialdata/commit/137e1e06c946800599d55c45f18fe8a6a1fb06eb. We already have existing SpatialData datasets where . is used as separator for naming components with different meanings, like Slide1.A2.0.pre_maldi. When loading them with spatialdata 0.2.1, we get the error "Name must contain only alphanumeric characters, underscores, and hyphens". I would understand for Unicode characters, or characters reserved for SpatialData itself (/) or characters problematic on a subset of supported platforms (:, ?, \…).

  1. Is there a reason for not allowing . anymore? Will it be used by SpatialData itself, or the NGFF specification?

  2. What would be a solution, can . be allowed again or should we change the way how we create new datasets and adjust all existing datasets to the new naming constraints?

    • _ feels more like a replacement for space within a naming component
    • - feels more fitting for joining parts of a naming component, not separating them.
    • What allowed character would convey the meaning of separating components of a name?
LucaMarconato commented 1 month ago

Related discussions

First, links to some related topics:

Rationale behind _check_valid_name()

The reason why we removed the support for "non a-zA-Z0-9_- characters" is because we wanted to avoid altogether edge cases that are OS dependent or due to the libraries we are using for IO.

In particular we removed the reason why we removed / and \ is to avoid potential security risks (we rely on third-party libraries for IO, and I expect a name starting with / not to lead to security risks, but I wanted to be extra safe and avoid the case in which a bug in such libraries uncovers an unexpected security risk).

The case of .

Instead, the reason why I removed . is because I saw it (rarely) used as a separator for Zarr groups in some datasets. Not a security risk, but I things may break in such scenario. We could consider re-enabling it since usually Zarr uses /.

Symbols for separation

I agree, neither _ nor - suggest "separation". I'd opt for : or | but IIRC they are forbidden in path for Windows. Maybe we could do some research on forbidden paths and see if allowing ,?

aeisenbarth commented 4 weeks ago

Here are my current thoughts about this issue. To solve it, I can investigate the third point, and then we can decide whether there is something to do/change in the code. Maybe @ivirshup's experience can be valuable?

LucaMarconato commented 4 weeks ago

Thanks @aeisenbarth for summing up the status quo. My vote would be to go for an allow list (eventually extending the current one), I'd ask for the feedback to others and see what is the consensus. In fact, I could imagine also going for a deny list, by combining together the forbid lists for Zarr, Parquet and for filenames in Windows, macOS and UNIX in general.

Regarding the last point, in my experience I have found some bugs sometimes (example 1, example 2) around the dimension separator, so I would be for disallowing . (on top of /).

LucaMarconato commented 4 weeks ago

Kindly asking some other folks to join the discussion. This message summarizes the discussion so you don't to read all of the above.

In your experience with Zarr, AnnData etc, do you think that we should put a constraint on the characters that can be used for the names of the images etc (e.g. [a-zA-Z0-9_-]) or should we have a list of forbidden characters (=better for non-English users)? In the second case, could you share your experience in which characters to forbid?

We want to avoid possible bugs due to clashes between special characters and how internal libraries we depend on deal with them (for instance having a / in a string may lead to problems with AnnData when saving to Zarr as discussed here https://github.com/scverse/anndata/issues/1447). My main motivation is to avoid possible security bugs as we can't test all the scenarios nor predict how each special character may behave with dependent libraries.

Thank you! @giovp @kevinyamauchi @joshmoore @d-v-b @will-moore

aeisenbarth commented 3 weeks ago

My research on the third point:

For SpatialData, I see problems with:

aeisenbarth commented 3 weeks ago

For my original issue, it would be enough to allow ., no matter which other character sets or restrictions are chosen.

d-v-b commented 3 weeks ago

Zarr allows names containing "." characters, and from a purely technical perspective there should be no interference between the names of arrays / groups and the names of chunks, since the encoding of chunk keys in zarr is unambiguously specified in a manner that is completely independent of node names.

LucaMarconato commented 3 weeks ago

Thanks for sharing the above information; I read through the Zarr discussion and it is indeed very informative. My take on this is that we could proceed as Zarr does in https://github.com/zarr-developers/zarr-specs/pull/196/files (which allows unicode characters, provides a short deny list, recommends a normalization approach and recommends to use common characters, such as a-z, A-Z, 0-9, -, _, .); but I think this may create problems downstream.

I would take a more conservative approach since I think that, at least for the moment, we should aim at improving simplicity and robustness, and that allowing unicode would fix some bugs related to datasets that have already been written using now forbidden characters, but would probably lead to new ones because of datasets that can be written and read by one OS/language but not from another, which I think is worse.

Let's wait for feedback from the others and happy to discuss and agree on a path forward together; meanwhile here is my proposal:

LucaMarconato commented 3 weeks ago

For my original issue, it would be enough to allow ., no matter which other character sets or restrictions are chosen.

In particular the proposal above would allow for the . in the names, fixing the original issue.

giovp commented 2 weeks ago

Thanks for the discussion, just went through this, and I agree that . should be allowed, since indeed while it's used for chunk keys, they don't occur with storage keys. I also agree with this point above

go with the allow-list approach: allow only the characters recommended by the Zarr specs and in addition disallow what Zarr disallows. a-z, A-Z, 0-9, -, _, . would be allowed and these would not: empty strings "", these two strings ., .., things starting with __.

LucaMarconato commented 2 weeks ago

Thanks @giovp. Since the approach I described is conservative and if in the future we end up allowing unicode etc we can always relax, I would implement the approach and close this issue. @aeisenbarth are you free to work on this task (the 4 checkboxes of my post above)? Otherwise I can also work on this.

aeisenbarth commented 6 days ago

Yes, I can do it (sorry, I didn't receive the notification for some reasons).

aeisenbarth commented 3 days ago

There is another complication: The current proposal allows different character cases (as in abc, ABC, Abc) which collide on case-insensitive file systems. Since the chosen approach prefers robustness, we should not leave it to whatever the writer's file system supports, but what any potential reader's file system supports.

Possible solutions would be:

  1. Further restrict the allowed character set to lower-case, e.g. allow {abc}, reject {ABC}
  2. Just require case-insensitive uniqueness of keys, e.g. allow {Abc}, but reject {Abc, abc, ABC}

I couldn't find any mention in the specs, especially the table proposal ngff#64. For AnnData, there is some implementation we could make use of, or further extend instead of implementing on SpatialData side (anndata#321).

I will go for solution 2 because it has the least impact on existing datasets (or table colums imported from CSV). Also I will test to what extent we can rely on existing name validation in AnnData so that we don't have to duplicate that functionality.

aeisenbarth commented 3 days ago

I noticed the previous implementation was not restricted to a-zA-Z0-9 (+_-), but isalnum which considers Unicode letters like é and as alphanumeric. Now changing to more restrictive a-zA-Z0-9, following Zarr v3 recommendation. (Still it would be an annoyance if users want to add columns containing µm or Chinese names and have to rename them first.)