Closed tallforasmurf closed 12 years ago
Four-pronged approach: Input (Load) or Output (Save), and encoding Inferred or Specified.
Input.
Encoding Specified:
provide a menu item File > Open With Encoding with a sub-menu listing supported input encodings, e.g.
File > Open With Encoding >
Latin-1 (ISO-8859-1)
UTF-8
Windows (CP1252)
MacRoman
UTF-16
(more?)
Set the chosen encoding in the IMC, then call the standard file picker dialog as for File >Load
Encoding Inferred:
File > Load
Call the standard file picker to get a path
Does .meta file exist? If so,
read the .meta file and from it, get the DOCUMENT_ENCODING value
note it in the IMC
break
# No .meta or .meta lacks DOCUMENT_ENCODING:
if the file extension is .win
set CP1252 in the IMC
break
if the file extension is .mac
set MACROMAN in the IMC
break
if the file extension is `.utf` or `utx` or `utf8`
set UTF-8 in the IMC
break
if the file name ends in `-u` or `-utf` or `-utf8`
set UTF-8 in the IMC
break
if the extension is .htm or .html
open the file as LATIN-1
read to a line that contains (charset=)|(</head)|(<body)|(</html)
close the file
if the match was to charset=
parse out the argument and set it in the IMC
else set LATIN-1 in the IMC
break
Open the file as raw bytes in Python (not Qt)
read 1000 bytes
close the file
Apply the chardet detector package to that sample
set what it finds in the IMC
Output
File > Save or File > Save As, either way you have a path
Encoding Inferred:
if the encoding in the IMC is LATIN-1
check the character census for count of items over x7f
if nonzero, put up a dialog:
File contains non-Latin-1 characters:
Save as UTF-8?
[Cancel] [Okay, UTF-8]
If cancel, exit
Set UTF-8 in the IMC
Write the .meta file including DOCUMENT_ENCODING value from the IMC
Write the file with that encoding
File > New
Put LATIN-1 in the IMC
File > New With Encoding
Same submenu as Open with Encoding
write the chosen encoding in the IMC
commit 0a4c8375
Per Lucy24,
.utf
is not a valid file suffix. Seehttp://file-extension.net/seeker/program_extension_unicode
for a list of extensions in use. However, she also says that "it has to be .txt" which imply doing the BOM thing on.txt
i.e. open the path as raw bytes, read a few, check for the BOM (http://en.wikipedia.org/wiki/Byte_order_mark
), then close the stream and reopen it with either the latin-1 or utf8 encode flag accordingly.Awaiting further suggestions/thoughts.