rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

support character encodings in supported documents #13

Closed rhdunn closed 12 years ago

rhdunn commented 12 years ago

The MIME headers (mhtml, email, http) can specify a character encoding. This should be adhered to when processing the internal data content (e.g. passed to the XML reader API).

Also, text documents can have a Byte-Order Mark that specifies what flavour of UTF encoding they are.

RTF documents handle their own encoding for escaped characters.

With text documents, it would be useful to specifically override/configure the default encoding (default as utf-8, override with e.g. cainteoir --encoding=latin-1 or as a preference/setting in cainteoir-gtk).

rhdunn commented 12 years ago

In addition to this work, the encoding APIs need improving:

  1. refactor set_encoding(int codepage) to only call set_encoding(string encoding) once;
  2. provide access to the current encoding name/codepage;
  3. add tests for the encoding APIs (aside from what is tested by the usage of the encoding API);
  4. don't reset the encoder if changing to the same encoding;
  5. don't pass utf-8 encoded strings through iconv -- return the buffer directly (utf-8 => utf-8 should be a no-op);
  6. provide some mechanism to signal to things like the xml reader that it is safe to parse the original buffer without decoding it all (utf-8, latin-1 or another single/multi-byte encoding with an ascii base) or whether it needs to decode the entire buffer first (minus any byte-order mark, for utf-16/32 le/be and any strange codepages that don't have an ascii base).
rhdunn commented 12 years ago

The encoding API work (with tests) is now in place.

rhdunn commented 12 years ago

This is done except for BOM handling.