Case folding of the file names

xfq commented 3 years ago

https://w3c.github.io/epub-specs/epub33/core/#sec-container-filenames

All File Names within the same directory MUST be unique following case normalization as described in section 3.13 of [Unicode].

This sentence does not seem to be very clear. Would you please clarify what algorithm of "case normalization" is used here? Is it Unicode full or simple casefolding?

Please refer to Case Mapping and Case Folding and Additional Considerations for Case Folding in charmod-norm.

mattgarrish commented 3 years ago

I can't find evidence of an intention one way or the other. The bullet goes back to OCF 2.0.1 where it also says the same thing with the only difference being a reference to TR21.

It says in charmod that simple folding isn't appropriate for the web, so is that effectively a recommendation we should specify full case folding?

aphillips commented 3 years ago

A likely source of the need for case folding is that some file systems (FAT32 classically) are not case sensitive. In these file systems, names that are distinguished solely by case can overwrite one-another causing problems.

You might want to specify Unicode canonical case-fold matching from charmod-norm rather than separately specifying case folding and normalization. I note that you currently have only a "should" for normalization but a "must" for case folding. Is there a reason why you don't require uniqueness across normalization (for which you should choose a normalization--we recommend NFC) (noting too that it's a question of uniqueness checking, not a requirement that the normalized form actually be stored)?

so is that effectively a recommendation we should specify full case folding?

Yes. Full case folds lose less information than simple casefolds, at the cost of potentially altering the length of the string in code points. Among other things, this probably means that the case fold (and normalization, if applied) needs to happen before the length limit is checked--although I note that the case fold and the normalization are not required to be stored. The names just need to be unique across the operations.

mattgarrish commented 3 years ago

Is there a reason why you don't require uniqueness across normalization

I don't have an answer to that, unfortunately, but maybe someone else in the group can chime in (that particular change goes back over a decade and I can't find a discussion about it).

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-04-23

List of resolutions:

Resolution No. 3: Change requirement for case normalization to a MUST and specify which algorithm we will use

View the transcript

#### 1.3. Case folding of the file names _See github issue [#1631](https://github.com/w3c/epub-specs/issues/1631)._ **Dave Cramer:** "file names must be unique following case normalization" **Matt Garrish:** i think Addison's note (in issue) about case folding is fine … he also asked why we have a SHOULD for normalization, but a MUST for case folding … not sure how to deal with these **Dave Cramer:** i think we can have MUST for both? … would we have something about which algorithm we're using? > **Proposed resolution: Change requirement for case normalization to a MUST and specify which algorithm we will use** *(Wendy Reid)* > *Brady Duga:* +1 > *Dave Cramer:* +1 > *Wendy Reid:* +1 > *Ivan Herman:* +1 > *Toshiaki Koike:* +1 > *Matt Garrish:* +1 > *Deborah Kaplan:* +1 > *Masakazu Kitahara:* +1 > *Ben Schroeter:* 0 > *Garth Conboy:* +1 > ***Resolution #3: Change requirement for case normalization to a MUST and specify which algorithm we will use***

w3c / epub-specs

Case folding of the file names #1631