pombreda / txt2tags

Automatically exported from code.google.com/p/txt2tags
GNU General Public License v2.0
0 stars 0 forks source link

Saving unicode content to file on Windows #57

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I got a problem running txt2tags 2.5 on Windows with Python 2.6 because it 
seems to not be able to write unicode strings to a file using writelines.

"""
#Part of function SaveFile (aprox. line 1615)
if type(contents) == type([]): doit = f.writelines
    else: doit = f.write
    doit(contents) ; f.close()
"""

'contents' is a list having unicode and non-unicode strings

I solved the problem with:

content = [content[i].encode('utf8') for i in range(len(content))] 

The encoding might be another for other people. Also, I don't know if it's 
availabe for older versions of Pyhton

PS: This first happened with 'sample.t2t' from sample folder

Original issue reported on code.google.com by jbv...@gmail.com on 5 Nov 2010 at 4:32

GoogleCodeExporter commented 9 years ago
Joao, your fix for this issue in python3 (r384, line 2182) will also work in 
python2 for Windows?

Original comment by aureliojargas@gmail.com on 9 Nov 2010 at 8:36

GoogleCodeExporter commented 9 years ago
People say "lightning doesn’t strike twice in the same place"...

I don't think **this** error was produced by what I said it was. It's possible 
to reproduce only using --gui option and selecting the file by clicking the 
'select' button.

My Windows username is 'João Bernardo' (yes, it has space and tilde and it's 
good only to provide errors). Seriously! I've sent lots of bug reports to 
different programs because of that.

So, the error happens JUST because the script prints the command line used!!
<!-- cmdline: txt2tags -t html C:/Users/João Bernardo/Desktop/a/sample.t2t -->

By doing 

contents = [i.encode('utf8') for i in contents] #this line is better than the 
previously written

we encode my username with UTF-8 and everything works (since ***ALL*** text is 
encoded in utf-8)

------------------
The greatest feature of Python 3 is the use of Unicode Strings so it's not 
affected by this problem. But....

Using the 'encode' method, we lost information and accents don't become 
possible. And, at the same time, you can't write unicodes string in binary 
files.

------------------

So what sould we do??
Python 2 -> the patch proposed will only work if the t2t file is encoded in 
utf-8.... But, knowing the problem is with tkinter, you can do:

newfile = askopenfilename(filetypes=ftypes).encode('utf-8')    #AT LINE ~5658

might solve the problem.... (haven't tested yet!!)

Python 3 -> Just doing "f = open(file_path, 'w')" is ok

Original comment by jbv...@gmail.com on 9 Nov 2010 at 10:29

GoogleCodeExporter commented 9 years ago
ops... askopenfilename() returns a file and not it's name. :(

So, the idea is to find where is the name of the file and encode it.

Original comment by jbv...@gmail.com on 9 Nov 2010 at 10:48

GoogleCodeExporter commented 9 years ago
Could this be fixed with python's ``str.decode``? With it we could take the 
declared encoding (or maybe even a separate ``--src-encoding``) and decode it 
into Unicode without losing accent information. Then afterward we could 
``encode`` the text back into the user's declared encoding before writing the 
file. Would that work without breaking anything?

Original comment by jamisee...@gmail.com on 17 Nov 2010 at 7:59

GoogleCodeExporter commented 9 years ago
I'm not having much time right now, but it seems easy to be fixed.
"str.decode" probably won't work. 
Doing -> u'ãáàä'.decode() <- raises an exception

After doing all file handling (things that may access dirs with accented names) 
the text should be encoded to {whatever txt2tags will be using} before 
appending to the list used in SaveFile().

The list generated (e.g. html) is something like that:
[ '<html>', '<head>', ... , u'<!-- cmdline: txt2tags -t html 
/folder/with/unicode/aãáä/file -->', '</body></html>' ]

>> The last but one item is an unicode string!

This is **not** a Windows-only related problem. I tried Debian (w/ Python 2.5) 
and got the same message using a file in my folder "/home/jb/joão/" (attached 
image)

That means Tkinter also gets file name in unicode on Linux (probably other *nix 
platforms too).

Original comment by jbv...@gmail.com on 17 Nov 2010 at 11:58

Attachments:

GoogleCodeExporter commented 9 years ago
Comment from Jason Seeley in txt2tags-dev
http://groups.google.com/group/txt2tags-dev/msg/9ad4b233d0061671

> "str.decode" probably won't work.
> Doing -> u'ãáàä'.decode() <- raises an exception

No, decode is used on an encoded string and (if passed the correct
encoding as an argument) returns a unicode string. It's the opposite
direction as encode.

so (assuming a utf-8 encoding for the text):

>>> 'ãáàä'.decode('utf-8') == u'ãáàä'
True
>>> u'ãáàä'.encode('utf-8') == 'ãáàä'
True

The thought being you could write your document in your normal
encoding (which is most likely utf-8, but could just as easily be
latin1 or Shift-JIS or something else entirely). Txt2tags could use
decode as above to convert that into unicode strings, so that all
internal operations work as expected, then re-encode to the proper
encoding afterward without losing special characters or accents as
long as the target encoding supports them.

Original comment by aureliojargas@gmail.com on 18 Nov 2010 at 2:05