[Feature] Unicode characters in text file [sf#3]

teambob / docfrac

DocFrac is a document converter that can convert between RTF, HTML and ASCII text. This includes RTF to HTML and HTML to RTF. Supports text formatting (e.g. bold); tables; and most European languages. Available for Windows; Linux; ActiveX and DLL.

GNU General Public License v2.0

13 stars 6 forks source link

[Feature] Unicode characters in text file [sf#3] #47

Open teambob opened 9 years ago

teambob commented 9 years ago

Reported by andrewpunch on 2004-06-09 03:25 UTC Characters which are non-ASCII are thrown away when writing to a text file.

There is no way around this while we write to an ASCII file.

There are some other options for file formats:

mapping to a code page (e.g. ANSI)
quoted printable (same as email)
UTF8
Straight Unicode

From a design perspective this could be achieved by creating maps from a single unicode character to one or more bytes.

There could be a map for:

ASCII
ANSI
iso8859 standards
UTF8
straight unicode

The map need not be static. It may be dynamic. For example the ASCII map may allow through all character codes with a unicode value less than 0x0080.

There must be a process for when a unicode character is not mappable using the current map.

Created on behalf of David at Nutmeg.

teambob commented 9 years ago

Commented by andrewpunch on 2004-06-10 10:50 UTC Logged In: YES user_id=928005

Another quick possibility is to use UTF-32 (little endian) encoding. This allows access to all the characters in a document without loss of information.

The technical specification is here: http://www.unicode.org/faq/specifications-jda.html

teambob commented 9 years ago

Commented by andrewpunch on 2004-10-11 06:43 UTC Logged In: YES user_id=928005

Detirmination: Text writer will output UTF8 by default in next version. This will be compatible with ASCII for english characters, but will keep other characters.

ASCII, UTF16/32 and other mappings will be available as options in later versions.

teambob commented 9 years ago

Updated by andrewpunch on 2004-10-11 06:43 UTC

summary: Non-ASCII characters in text file --> Unicode characters in text file

teambob commented 9 years ago

Updated by andrewpunch on 2004-10-11 06:45 UTC

priority: 5 --> 8

teambob commented 9 years ago

Commented by andrewpunch on 2005-04-11 12:41 UTC Logged In: YES user_id=928005

This is scheduled for inclusion in 3.2.0 as UTF8 output for "text" files.