encoding issue with guiguts import

bibimbop commented 11 years ago

I have a latin1 document with accents (èêé...). I export it to guiguts, quit ppqt and remove the meta file. I re-open ppqt, open the document and select import from guiguts.

Instead of

Le livre de ce qu'il y a dans l'Hadès. (Bibliothèque de l'Ecole

I get

Le livre de ce qu'il y a dans l'Hadиs. (Bibliothиque de l'Ecole

tallforasmurf commented 11 years ago

I presume if you use File > Open with Encoding > Latin1 it is ok?

For some reason, inferTheCodec infers something else. Is the file suffix .htm or .html or .xml? That would default to UTF if there's no charset attribute as yet.

Otherwise it must be the chardet package picking up something. If that is the case I think I'll dump it. So you would get an error "can't infer the codec". Which is better than inferring the wrong one.

bibimbop commented 11 years ago

You should really get rid rid of these heuristics. They don't work. What if the html says latin1 but the file is actually utf8 ?, ...

Load the file and let python tell you the encoding, and let python tell you the encoding to use for saving.

That code is foolproof, and fast (<1s).

# Load a file and guess its encoding
def inferTheCodec(self, filename):
    rawdata = None
        with open(filename, "rb") as f:
            rawdata = f.read()
    except Exception:
        return None

    # Try utf-8 first, then latin-1. Order is important. That
    # should cover all reasonable cases.
    for codec in [ self.utfEncoding, self.ltnEncoding ]:
        try:
            # decode() will generate an exception if the codec is
            # wrong.
            text = rawdata.decode(unicode(codec))
        except Exception:
            continue

        return codec

     return None

bibimbop commented 11 years ago

I rest my case:

Traceback (most recent call last):
  File "/...../pqMain.py", line 315, in <lambda>
    lambda : self.fileOpen(None),
  File "/...../pqMain.py", line 872, in fileOpen
    self.loadFile(bookname, encoding) # see next method
  File "/...../pqMain.py", line 914, in loadFile
    encoding = self.inferTheCodec(bookInfo,metaInfo,True,fallBack=None)
  File "/...../pqMain.py", line 801, in inferTheCodec
    if ('confidence' in result) : # detector is working
TypeError: argument of type 'NoneType' is not iterable

tallforasmurf commented 11 years ago

I will investigate the TypeError.

If somebody writes encoding='latin-1' into an HTML header and stores it utf, that is a user error. The W3C standard is very clear: the first 1024 bytes must be Latin-1-compatible and contain a correct encoding= attribute for the rest of the file.

Also I still feel constrained to support the documented PG/PGDP convention for filenames with -xxx.txt. Yes, somebody could rename a file mybook-utf.txt when it is really latin-1, that would still be a user error.

I think I will probably throw away the chardet, it appears to be useless. The only point to it, was to hopefully be able to open UTF-16 or some some weird codepage like Microsoft's Greek or Turkish. I will add to the help file that if your book isn't already utf-8 or Latin1 you must convert it to utf-8 somehow before opening it.

I investigated using python rawbytes.decode as a detector with interesting results. Last year I prepared a file char-sets-utf.txt that contains all the glyphs supported by ISO-8859-1, CP1252, and MacRoman. I'll email it to you if you like, as I can't attach it here.

From this file I made 3 other files, using BBEdit to save them using the ISO-8859-1, CP1252, and MacRoman codecs respectively. I made a similar function to yours:

def test_file(path):
  f = open(path, "rb")
  codec_list = [ 'utf-8', 'iso-8859-1', 'cp1252', 'macroman' ]
  rb = f.read()
  for codec in codec_list :
    try:
      decoded_text = rb.decode(codec)
      print('{0} detected'.format(codec))
    except:
      print('{0} rejected'.format(codec))
      continue

Here are the results for the four files.

test_file(path+'char-sets-utf.txt')
utf-8 detected
iso-8859-1 detected
cp1252 rejected
macroman detected

This is clearly wrong, there are definitely glyphs in the -utf file that are not in Latin1 and not in macroman.

test_file(path+'char-sets-ltn.txt')
utf-8 rejected
iso-8859-1 detected
cp1252 detected
macroman detected

There is no binary difference between Latin-1 and the other two 8-bit codings, only user intent. So it "detects" all three.

test_file(path+'char-sets-cp1252.txt')
utf-8 rejected
iso-8859-1 detected
cp1252 detected
macroman detected

test_file(path+'char-sets-mac.txt')
utf-8 rejected
iso-8859-1 detected
cp1252 rejected
macroman detected

I wonder what it sees in the -mac text that makes it reject cp1252?

Granted if we only supported utf and ltn, taking the first successful encode would do the job. But that's just as much a heuristic as going on file suffixes.

bibimbop commented 11 years ago

PGDP creates latin1 documents and DP-IT and PG-Canada create utf-8 documents. That's the audience of PPQT. If someone is converting that to something else, it is their problem, they will notice, and there's a workaround for it (open with encoding).

Don't make it difficult for the regular case.

Also a user may rename projectIDxxxxxxxx.txt (from PGDP, so latin1) to project-name-utf.txt because eventually that file will become utf8 encoded.

In any case, I changed that behaviour in my fork, so I'm not affected anymore.

tallforasmurf commented 11 years ago

Commit b1f4a4a and f04bbda changed the action when the book codec cannot be inferred. Now it gives a dialog with three choices, Open UTF-8, Open Latin-1, and Cancel. I believe all the bases are covered now!

If the PG-approved filename convention is used, we use it. If there's a .meta file, we use that. If it is HTML with a proper encoding= attribute, we use that. If none of the above, we give the user an easy one-click choice of the most common encodings. And for win/mac there's the menu choice.

To quote Ed China of Wheeler Dealers: "Job done, I reckon."

tallforasmurf / PPQT

encoding issue with guiguts import #153