openpreserve / odf-validator

Open source Open Document Format (ODF) validation
http://odf.openpreservation.org/
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

PKG-5 definition #46

Closed dewhattens closed 9 months ago

dewhattens commented 11 months ago

Looking a the declaration of all mimetype files I can see they are of type UFT8.

Dont understand the definition of PGK-5 which states they should be ascii. Unicode is a superset of ascii

carlwilson commented 10 months ago

This criteria is to ensure easy "magic" identification of ODF packages, see the further note in 3.3:

Note: The purpose is to allow the type of document represented by the package to be discovered through 'magic number' mechanisms, such as Unix's file/magic utility. If a Zip file contains a file at the beginning of the file that is uncompressed, and has no extra data in the header, then its file name and data can be found at fixed positions from the beginning of the package. More specifically, one will find:

  • the string 'PK' at position 0 of all zip files
  • the string 'mimetype' beginning at position 30
  • the media type itself beginning at position 38. This mechanism can only work IF the mimetype file is ASCII/UTF-8/UTF-7 encoded with no Byte Order Mark (BOM). If the mimetype file is encoded in UTF-16 or UTF-32, or has a BOM, then the magic number identification will fail.

Detection of encoding will be imperfect and feels unnecessary. The file's contents are always read, assuming the encoding is UTF-8. If the mimetype file is anything other than ASCII/UTF-8 encoded then other errors would be incurred as the value read would be incorrect and would not match the value read from the manifest.