Closed GoogleCodeExporter closed 9 years ago
Assigning to Garth as he probably knows the history of this list.
Original comment by markus.g...@gmail.com
on 11 Apr 2011 at 8:15
That list (of character restrictions in ZIP-encoded file names) dates back the
first incarnation of OCF, and was really meant to be prohibitions against use
of characters that one might reasonably want to use in filenames. The ZIP
specification (http://www.pkware.com/documents/casestudies/APPNOTE.TXT), to my
read, doesn't really provide any limitations as to what characters can be used
in UTF-8 encoded filenames. RFC 3987 covers IRI encoding, and doesn't seem
logically informing of ZIP filename character sets.
I propose that the following could be done without breaking backwards
compatibility. Expand the list to include:
-- C0 (0x00 - 0x1F)
-- DEL (0x7F)
-- C1 (0x80 - 0x9F) .
Original comment by ga...@google.com
on 12 Apr 2011 at 12:41
Yesterday, I learned some interesting-but-horrible facts about UTF-8 zip
item names.
1) JAR uses UTF-8M, which (illegally) represents a surrogate pair by 6 bytes.
2) Apple ZIP uses UTF-8 after applying the NFD normalization of Unicode.
3) The MS implementation of ZIP on Windows (even now!) uses locale-dependent
encoding (such as Code Page 932) rather than UTF-8.
4) The use of UTF-8 is announced by PKWARE's "General Purpose Flags Bit 11"
indicator or Info-ZIP's new "up" unicode path extra field.
Given this hopeless situation, it might be a good idea to disallow the
use of non-ASCII characters in EPUB3 as a tentative solution.
I do believe that RFC 3987 is relevant, since some fragment identifier scheme
might expose zip item names.
Original comment by eb2m...@gmail.com
on 12 Apr 2011 at 1:40
I don't think we can go further than the above proposed resolution and retain
backwards compatibility, and I don't think this has proven a issue in practice.
And, the additional proposed restrictions do seem to have some value.
Original comment by ga...@google.com
on 12 Apr 2011 at 3:04
Later this week, the following will be sent to the working group as the "Chairs
recommended resolution."
In OCF section 2.4, the ZIP file name character restrictions will be augmented
to include:
-- DEL (ASCII 0x7F)
-- Unicode C0 range (0x00 - 0x1F)
-- Unicode C1 range (0x80 - 0x9F)
Original comment by ga...@google.com
on 19 Apr 2011 at 11:47
Even if we allow UTF-8, I propose to disallow characters as below.
They are not allowed as part of ipchar in RFC 3987.
- Private Use Area (E000-F8FF)
- Non characters in Arabic Presentation Forms-A (FDDO-FDEF)
- Specials (FFF0-FFFF)
- Tags and Variation Selectors Supplement (E0000 to E0FFF)
- Supplementary Private Use Area-A (F0000..FFFFF)
- Supplementary Private Use Area-B (100000..10FFFF)
Likewise, the following characters are not allowed as part of ipchar
in RFC 3987. If we allow them as file names in ZIP packages, we have
to use %HH for referencing files names in fragment identifiers. This
is certainly doable but do we really need these characters (except
0020)?
0020 SPACE
0023 NUMBER SIGN
0025 PERCENT SIGN
003B SEMICOLON
005B LEFT SQUARE BRACKET
005D RIGHT SQUARE BRACKET
005E CIRCUMFLEX ACCENT
0060 GRAVE ACCENT
007B LEFT CURLY BRACKET
007C VERTICAL LINE
007D RIGHT CURLY BRACKET
007E TILDE
007F DEL (already covered by Garth's proposal)
Original comment by eb2m...@gmail.com
on 25 Apr 2011 at 10:22
The goal with my proposal was to tighten up the file naming restrictions but
with very low odds of invalidating any names that would currently be encoded
OCF 2.0.1 packages. It seems the esoteric private-use/special ranges that
Murata proposes above could fall under that blanket, and could thus be excluded
for EPUB 3 (if desired).
As for the other characters in the 0020 to 007E range, I wouldn't think these
should be excluded. Many (most?) OCF's contain filenames that use SPACE, and
if we have one common character that needs to be URL encoded for references, I
don't see any additional harm in allowing the others that are likely currently
in use (though not as frequently as SPACE).
Comments from folks? Revise 72-hour notice? Stay the course? Discuss on
Wednesday?
Original comment by ga...@google.com
on 25 Apr 2011 at 3:35
Resolved as per
http://groups.google.com/group/epub-working-group/browse_thread/thread/03df9a1ef
f7efcb5#
Original comment by markus.g...@gmail.com
on 30 Apr 2011 at 2:17
Original issue reported on code.google.com by
eb2m...@gmail.com
on 10 Apr 2011 at 6:55