w3c / epub-specs

Shared workspace for EPUB 3 specifications.
Other
305 stars 60 forks source link

Permissible file names in OCF #125

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
3.4 File Names of OCF tries to specify permissible file names in OCF.  However, 
the list of disallowed characters appear to be incomplete. Are C0 andC1 control 
functions allowed? It is probably a good idea to use the intersection of what 
RFC 3987 and ZIP allow. Note that OOXML also tries to create a list of 
forbidden characters.

Original issue reported on code.google.com by eb2m...@gmail.com on 10 Apr 2011 at 6:55

GoogleCodeExporter commented 9 years ago
Assigning to Garth as he probably knows the history of this list. 

Original comment by markus.g...@gmail.com on 11 Apr 2011 at 8:15

GoogleCodeExporter commented 9 years ago
That list (of character restrictions in ZIP-encoded file names) dates back the 
first incarnation of OCF, and was really meant to be prohibitions against use 
of characters that one might reasonably want to use in filenames.  The ZIP 
specification (http://www.pkware.com/documents/casestudies/APPNOTE.TXT), to my 
read, doesn't really provide any limitations as to what characters can be used 
in UTF-8 encoded filenames.  RFC 3987 covers IRI encoding, and doesn't seem 
logically informing of ZIP filename character sets.

I propose that the following could be done without breaking backwards 
compatibility.  Expand the list to include:

-- C0 (0x00 - 0x1F)
-- DEL (0x7F)
-- C1 (0x80 - 0x9F) . 

Original comment by ga...@google.com on 12 Apr 2011 at 12:41

GoogleCodeExporter commented 9 years ago
Yesterday, I learned some interesting-but-horrible facts about UTF-8 zip 
item names.

1) JAR uses UTF-8M, which (illegally) represents a surrogate pair by 6 bytes.

2) Apple ZIP uses UTF-8 after applying the NFD normalization of Unicode.

3) The MS implementation of ZIP on Windows (even now!) uses locale-dependent 
   encoding (such as Code Page 932) rather than UTF-8.

4) The use of UTF-8 is announced by PKWARE's "General Purpose Flags Bit 11" 
    indicator or Info-ZIP's new "up" unicode path extra field.

Given this hopeless situation, it might be a good idea to disallow the 
use of non-ASCII characters in EPUB3 as a tentative solution.

I do believe that RFC 3987 is relevant, since some fragment identifier scheme 
might expose zip item names.

Original comment by eb2m...@gmail.com on 12 Apr 2011 at 1:40

GoogleCodeExporter commented 9 years ago
I don't think we can go further than the above proposed resolution and retain 
backwards compatibility, and I don't think this has proven a issue in practice. 
  And, the additional proposed restrictions do seem to have some value.

Original comment by ga...@google.com on 12 Apr 2011 at 3:04

GoogleCodeExporter commented 9 years ago
Later this week, the following will be sent to the working group as the "Chairs 
recommended resolution."

In OCF section 2.4, the ZIP file name character restrictions will be augmented 
to include:

-- DEL (ASCII 0x7F)
-- Unicode C0 range (0x00 - 0x1F)
-- Unicode C1 range (0x80 - 0x9F)

Original comment by ga...@google.com on 19 Apr 2011 at 11:47

GoogleCodeExporter commented 9 years ago
Even if we allow UTF-8, I propose to disallow characters as below.
They are not allowed as part of ipchar in RFC 3987.

- Private Use Area (E000-F8FF)
- Non characters in Arabic Presentation Forms-A (FDDO-FDEF)
- Specials (FFF0-FFFF)
- Tags and Variation Selectors Supplement (E0000 to E0FFF)
- Supplementary Private Use Area-A (F0000..FFFFF)
- Supplementary Private Use Area-B (100000..10FFFF)

Likewise, the following characters are not allowed as part of ipchar
in RFC 3987.  If we allow them as file names in ZIP packages, we have
to use %HH for referencing files names in fragment identifiers.  This
is certainly doable but do we really need these characters (except
0020)?

0020 SPACE
0023 NUMBER SIGN
0025 PERCENT SIGN
003B SEMICOLON
005B LEFT SQUARE BRACKET
005D RIGHT SQUARE BRACKET
005E CIRCUMFLEX ACCENT
0060 GRAVE ACCENT
007B LEFT CURLY BRACKET
007C VERTICAL LINE
007D RIGHT CURLY BRACKET
007E TILDE
007F DEL (already covered by Garth's proposal)

Original comment by eb2m...@gmail.com on 25 Apr 2011 at 10:22

GoogleCodeExporter commented 9 years ago
The goal with my proposal was to tighten up the file naming restrictions but 
with very low odds of invalidating any names that would currently be encoded 
OCF 2.0.1 packages.  It seems the esoteric private-use/special ranges that 
Murata proposes above could fall under that blanket, and could thus be excluded 
for EPUB 3 (if desired). 

As for the other characters in the 0020 to 007E range, I wouldn't think these 
should be excluded.  Many (most?) OCF's contain filenames that use SPACE, and 
if we have one common character that needs to be URL encoded for references, I 
don't see any additional harm in allowing the others that are likely currently 
in use (though not as frequently as SPACE).

Comments from folks?  Revise 72-hour notice?  Stay the course?  Discuss on 
Wednesday?

Original comment by ga...@google.com on 25 Apr 2011 at 3:35

GoogleCodeExporter commented 9 years ago
Resolved as per 
http://groups.google.com/group/epub-working-group/browse_thread/thread/03df9a1ef
f7efcb5#

Original comment by markus.g...@gmail.com on 30 Apr 2011 at 2:17