pyg3t / poproofread

GNU General Public License v3.0
0 stars 0 forks source link

Can not use file which is not UTF-8 encoded #5

Closed KennethNielsen closed 8 years ago

KennethNielsen commented 9 years ago

I tried to open a file which uses latin1 (ISO-8859-1) encoding. It gave this warning message to the console:

/usr/lib/pymodules/python2.7/poproofread/poproofread_gtk.py:261: GtkWarning: gtk_text_buffer_emit_insert: assertion `g_utf8_validate (text, len, NULL)' failed textbuffer.insert(startiter, text)

I can go forward and back in the file with PageUp and PageDown keys, but diff and comment windows are all empty, except for last one:

Number of messages: 3

which happens to be the only chunk with only ASCII chars.


Imported from Launchpad using lp2gh.

KennethNielsen commented 9 years ago

(by askhl) Maybe the best way to fix this is to always use Python unicode objects internally in poproofread, since it has to be handled correctly by gtk. The decode() method in gtparse can be used to do this easily (if it works right now; otherwise we should fix it first).

KennethNielsen commented 9 years ago

(by k-nielsen81) Hallo Byrial

Thanks for reporting this. Character encoding was one of those things that I had deliberately not done yet because it is tricky and not very funny :| But it definitely needs to be done so now is a good a time as any.

@Ask. I agree that the best way to handle this is to go all Python unicode internally. So we'll decode at parse time en possibly encode back at export time. Regarding how to determine the character encoding I'll give that a little more thought. I would like it to remain independent of pyg3t for essential functions, so my initial thoughts is to: 1: Look for the magic character encoding words from the po-files in the first chunk. 2: If that fails I think I have read that there is a character encoding guessing lib that might be used as a fall back

In both cases do the read and re-read with correct encoding trick from the parser.

But actually. Since we have just determined that podiff's will always contain a header and that the program is designed to work on podiffs and po-files 1 really should cover it.

Regards Kenneth

KennethNielsen commented 9 years ago

(by k-nielsen81) Byrial, can you provide a test case file for this (preferably a podiff). It has been some time since I have encountered a file in an encoding different from UTF-8.

KennethNielsen commented 9 years ago

(by byrial-t) I attach a diff file produced by podiff with ISO8859-1 encoding.

KennethNielsen commented 9 years ago

(by k-nielsen81) Thanks.

KennethNielsen commented 9 years ago

(by k-nielsen81) Note to self. Missing: Handle char set warnings on save in poproofread_gtk and uncomment return statements in __detect_character_encoding

KennethNielsen commented 9 years ago

(by k-nielsen81) Note to self. Missing: Trim down the codec list with invalid codecs, add comment about trying to save in the dialog and test.

KennethNielsen commented 9 years ago

(by k-nielsen81) Fixed with revision 92

KennethNielsen commented 8 years ago

Already fixed. Closing.