tallforasmurf / PPQT

A post-processing tool for PGDP written in Python, PyQt4, and Qt
GNU General Public License v3.0
4 stars 2 forks source link

Wrong use of file extensions esp. utf #76

Closed tallforasmurf closed 12 years ago

tallforasmurf commented 12 years ago

Per Lucy24, .utf is not a valid file suffix. See http://file-extension.net/seeker/program_extension_unicode for a list of extensions in use. However, she also says that "it has to be .txt" which imply doing the BOM thing on .txt i.e. open the path as raw bytes, read a few, check for the BOM (http://en.wikipedia.org/wiki/Byte_order_mark), then close the stream and reopen it with either the latin-1 or utf8 encode flag accordingly.

Awaiting further suggestions/thoughts.

tallforasmurf commented 12 years ago

Four-pronged approach: Input (Load) or Output (Save), and encoding Inferred or Specified.

Input.

Encoding Specified:

provide a menu item File > Open With Encoding with a sub-menu listing supported input encodings, e.g.

File > Open With Encoding >
    Latin-1 (ISO-8859-1)
    UTF-8
    Windows (CP1252)
    MacRoman
    UTF-16
    (more?)

Set the chosen encoding in the IMC, then call the standard file picker dialog as for File >Load

Encoding Inferred:

File > Load
Call the standard file picker to get a path
Does .meta file exist? If so,
    read the .meta file and from it, get the DOCUMENT_ENCODING value
    note it in the IMC
    break
# No .meta or .meta lacks DOCUMENT_ENCODING:
if the file extension is .win
        set CP1252 in the IMC
        break
if the file extension is .mac
        set MACROMAN in the IMC
        break
if the file extension is `.utf` or `utx` or `utf8`
        set UTF-8 in the IMC
        break
if the file name ends in `-u` or `-utf` or `-utf8`
        set UTF-8 in the IMC
        break
if the extension is .htm or .html
        open the file as LATIN-1
        read to a line that contains (charset=)|(</head)|(<body)|(</html)
        close the file
        if the match was to charset=
            parse out the argument and set it in the IMC
        else set LATIN-1 in the IMC
        break
Open the file as raw bytes in Python (not Qt)
read 1000 bytes
close the file
Apply the chardet detector package to that sample
set what it finds in the IMC

Output

File > Save or File > Save As, either way you have a path

Encoding Inferred:

if the encoding in the IMC is LATIN-1
    check the character census for count of items over x7f
    if nonzero, put up a dialog:
        File contains non-Latin-1 characters:
            Save as UTF-8?
        [Cancel]     [Okay, UTF-8]
If cancel, exit
Set UTF-8 in the IMC

Write the .meta file including DOCUMENT_ENCODING value from the IMC
Write the file with that encoding

File > New

Put LATIN-1 in the IMC

File > New With Encoding

Same submenu as Open with Encoding
write the chosen encoding in the IMC
tallforasmurf commented 12 years ago

commit 0a4c8375