rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

improve support for archive/container formats #23

Closed rhdunn closed 12 years ago

rhdunn commented 12 years ago

The ePub and ODT/ODF formats are examples of archived/compressed formats. These consist of:

  1. a filesystem format (zip);
  2. an uncompressed mimetype file containing the mimetype of the underlying format;
  3. a META-INF folder containing metadata about the document;
  4. the format contents.

Multi-part documents can be grouped together under the same folder name without any formal overarching metadata/organisation (aside from the names being sequentially ordered on the file system).

The API will look something like this:

struct filesystem
{
    virtual std::pair<rdf::uri, std::shared_ptr<buffer>>
    read(const char *filename) = 0;

    virtual ~filesystem();
};

std::shared_ptr<filesystem> create_filesystem(const char *path);

If path is a directory, it will create a filesystem object that reads files on the system.

If path is a zip file, it will create a filesystem object that reads files in the zip file.

The filesystem logic will then look for a mimetype file. If that file exists, it will use that file as the mimetype to look up the content loader for the appropriate filetype, passing that parser the filesystem object. These will then be the ePub, ODF and other content processors.

If the mimetype file does not exist, the processor will assume a multi-part document is being read and will enumerate all items in that level attempting to parse those documents (ignoring any failures). Each item will be reported as a TOC entry.

rhdunn commented 12 years ago

This has been implemented for zip-based containers. Filesystem-based containers will be supported later.