Guesser fails with non-ascii characters..

ralsina / aranduka

Automatically exported from code.google.com/p/aranduka

0 stars 1 forks source link

Guesser fails with non-ascii characters.. #61

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Trying to guess a book (or author) name with non ascii characters like "¿Por 
Qué Socialismo?" fails miserably with an Unicode Warning.

Original issue reported on code.google.com by algoz...@gmail.com on 20 Apr 2011 at 4:58

GoogleCodeExporter commented 9 years ago

This issue was updated by revision 1ea314350be8.

Starting to work on this

Original comment by andresgattinoni on 29 Aug 2011 at 4:16

GoogleCodeExporter commented 9 years ago

This issue was closed by revision 265ed60e3450.

Original comment by andresgattinoni on 29 Aug 2011 at 4:16

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Ramiro, can you test it before I merge the fix to the integrate branch?

Original comment by andresgattinoni on 29 Aug 2011 at 4:18

GoogleCodeExporter commented 9 years ago

Oki doki, I'll test it tonight and let you know.

Original comment by algoz...@gmail.com on 29 Aug 2011 at 5:00

GoogleCodeExporter commented 9 years ago

Andrés,

 Testing this fix I've found that Alibris isn't working with non-ascii chars. I've tryied for example with "Televisión" and I get a "can't decode" error.

 OTOH, the Google one seems to work OK.

Original comment by algoz...@gmail.com on 5 Sep 2011 at 3:10

GoogleCodeExporter commented 9 years ago

Ok, now the problem is in another place. It's on line 71 of the Alibris plugin:
http://code.google.com/p/aranduka/source/browse/src/plugins/guess_alibris/__init
__.py?name=issue61#71

When it tries to decode the title that comes from Alibris, in some cases it 
raises that exception. If I remove the .decode('utf-8') or do 
unicode(book.get('title', 'No Title')), it doesn't fail but some characters are 
not displayed properly.

These encoding issues are always a pain... I'm not sure how it would be the 
best way to fix this.

Original comment by andresgattinoni on 7 Sep 2011 at 3:51

GoogleCodeExporter commented 9 years ago

Reading the API docs I found this page to try querys:
http://developer.alibris.com/iodocs

I searched for "Televisión" and this is the response:

<?xml version="1.0" encoding="iso-8859-1" ?>
<ALIBRIS xmlns:dt="urn:schemas-microsoft-com:datatypes">
[...]a lot of xml[...]

Sooo.. it appears that the response is encoded in iso-8859-1 instead of UTF ;-)

Original comment by algoz...@gmail.com on 7 Sep 2011 at 4:13

GoogleCodeExporter commented 9 years ago

I tried doing .decode('iso-8859-1'), but it's the same, I get the error: 
"'ascii' codec can't encode character u'\xe8' in position 2: ordinal not in 
range(128)"

Original comment by andresgattinoni on 7 Sep 2011 at 5:06

GoogleCodeExporter commented 9 years ago

If you remove the .decode('[...]') and do a print of type(title) after that you 
can see that the title is already a unicode string, so there's no need to 
decode it. Why some characters are not displayed correctly is a mistery; I 
think is an Alibris problem. I suggest leaving it without the .decode() method.

Original comment by algoz...@gmail.com on 7 Sep 2011 at 5:42

GoogleCodeExporter commented 9 years ago

Ok, I agree.

Original comment by andresgattinoni on 8 Sep 2011 at 1:06

GoogleCodeExporter commented 9 years ago

This issue was updated by revision 085323b97e8e.

Please review this, so that I can merge the fix to the integrate branch

Original comment by andresgattinoni on 8 Sep 2011 at 1:09