isso-comments / isso

a Disqus alternative
https://isso-comments.de
MIT License
5.03k stars 440 forks source link

Cannot import comments from Wordpress with Unicode characters #93

Closed jgosmann closed 10 years ago

jgosmann commented 10 years ago

Trying to import comments an Wordpress-XML containing Unicode characters I get the following exception:

Traceback (most recent call last):
  File "/Volumes/Home/blubb/Library/Python/2.7/bin/isso", line 9, in <module>
    load_entry_point('isso==0.9.dev0', 'console_scripts', 'isso')()
  File "/Volumes/Home/blubb/Documents/programming/isso/isso/__init__.py", line 228, in main
    migrate.dispatch(args.type, mydb, args.dump)
  File "/Volumes/Home/blubb/Documents/programming/isso/isso/migrate.py", line 270, in dispatch
    WordPress(db, dump).migrate()
  File "/Volumes/Home/blubb/Documents/programming/isso/isso/migrate.py", line 221, in migrate
    progress.update(i, thread.find("title").text)
  File "/Volumes/Home/blubb/Documents/programming/isso/isso/migrate.py", line 54, in update
    sys.stdout.write("\r[{0:.0%}]  {1}".format(i/self.end, message))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)

The file is UTF-8 encoded. The problematic character is a German umlaut (ä).

I was using the current master branch.

jgosmann commented 10 years ago

Might be relevant: http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/

jgosmann commented 10 years ago

Setting LC_CTYPE as suggested in the link above does not work for me. However, the problem wasn't to hard to fix by making the encoding of the progress output failsafe.

posativ commented 10 years ago

Should be fixed now.

jgosmann commented 10 years ago

Just tested isso 0.9.2 and it seems to work. :)

pvorb commented 9 years ago

I get the following error message when trying to import comments from Disqus:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3282: ordinal not in range(128)

isso --version

tells me I'm using v0.9.8.

pvorb commented 9 years ago

Full stack trace:

Traceback (most recent call last):
  File "/usr/local/bin/isso", line 9, in <module>
    load_entry_point('isso==0.9.8', 'console_scripts', 'isso')()
  File "/usr/local/lib/isso/local/lib/python2.7/site-packages/isso/__init__.py", line 231, in main
    migrate.dispatch(args.type, mydb, args.dump)
  File "/usr/local/lib/isso/local/lib/python2.7/site-packages/isso/migrate.py", line 261, in dispatch
    peek = fp.read(io.DEFAULT_BUFFER_SIZE)
  File "/usr/local/lib/isso/lib/python2.7/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3282: ordinal not in range(128)
posativ commented 9 years ago

It is definitely a bug in Isso. Can you give me details about your environment, specifically: LANG, LANGUAGE , LC_ALL, LC_CTYPE?

pvorb commented 9 years ago

Here's the result from locale. Should I reset any of these?

LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
posativ commented 9 years ago

It should work with LANG=C.UTF-8 I think. The main issue is that Python uses the mentioned variables to determine the file encoding when readinga file as non-binary (aka text).

pvorb commented 9 years ago

I tried it and it worked. Thank you.

posativ commented 9 years ago

The UnicodeDecodeError during the import is now fixed, but the issue remains for the configuration file. If your configuration file contains non-ascii characters Isso will fail and I am not attempt to fix this, because

In the end, all you have to do is: configure your locale correctly (or at least the encoding, C.UTF-8 is sufficient).

pvorb commented 9 years ago

This will be fine, I guess. Probably it's worth mentioning in the documentation somewhere.