Searching words with Danish characters at DDO

antortjim commented 7 years ago

I am experiencing a little difficulty using download audio with Danish. I have noticed that some words are not downloadable by the add-on despite they are available at Den Danske Ordbog.

All of these words have sth in common.. they have at least one of the Danish special characters (æ-ø-å)

Example:

får glædelig først

All three of them have available audio files at DDO, but they are not downloaded. I tried using the "standarization" (ø -> oe, å -> aa, æ -> ae) but that doesn't work either.

Could you implement this?? Thank you :)

ospalh commented 7 years ago

Hmm. I’ll look into this. Thanks for the detailed report.

antortjim commented 7 years ago

Thanks for such a fast answer! I had a look at the repo and I guess the origin might be in get_data_from_url(), which is defined in https://github.com/ospalh/anki-addons/blob/develop/downloadaudio/downloaders/downloader.py#L179 I see the encoding used is ascii and my theory is that ascii can't handle æøå... not sure how this can be fixed without affecting the other downloaders, but hope it helped :)

ospalh commented 7 years ago

It may be the other way around, some function that expects ASCII getting a unicode object. To find out what gets thrown from were, please look at this bit of code/comment: https://github.com/ospalh/anki-addons/blob/develop/downloadaudio/download.py#L80-L85 (I’m not doing much more with this today.) eta: Maybe it needs a strategic unicode.encode('utf-8') someplace.

antortjim commented 7 years ago

Yes! Great help So far I found that when the query word contains å, the corresponding urls contain %C3%A5, which is, according to this website due to "UTF-8 bytes being interpreted as Windows-1252 (or ISO 8859-1) bytes".

The error returned by Anki after uncommenting the raise in https://github.com/ospalh/anki-addons/blob/develop/downloadaudio/download.py#L80-L85: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 1: ordinal not in range(128)

antortjim commented 7 years ago

The line of code that transforms å to %C3%A5 is https://github.com/ospalh/anki-addons/blob/develop/downloadaudio/downloaders/den_danske_ordbog.py#L38

Proof:

import urllib
urllib.urlencode(dict(query="å"))

returns 'query=%C3%A5'

ospalh commented 7 years ago

It's not really any specifc 8 bit encoding like Windows-1252 or ISO 8859-1. “å”, or officially “LATIN SMALL LETTER A WITH RING ABOVE” is Unicode code point U+00E5, that encodes to UTF-8 as the two bytes 0xC3 and 0xA5, those two bytes are URL encoded to the 6 character ASCII string “%C3%A5”, which modern browsers translate back to the "å” you may see in the address bar. (…)

antortjim commented 7 years ago

Alright, that line is indeed supposed to do exactly that But I think I found the solution. If https://github.com/ospalh/anki-addons/blob/develop/downloadaudio/downloaders/den_danske_ordbog.py#L38 is replaced with

self.url + urllib.urlencode(dict(query=field_data.word.encode('UTF-8'))))

the problem is solved. As you predicted, some magical encode('UTF-8') was needed . Now I can download glædelig audio files! :)

ospalh commented 7 years ago

urllib.urlencode(dict(query="å"))

You are sort of cheating there. Anki and urllib.urlencode are Python2. This is important. What you did (and i tried yesterday) is that the ‘"å"’ from your example is a two byte string (str)with a UTF-8 å, that is , with the two bytes 0xc3 and 0x45:

>>> type('å')
<type 'str'>
>>> len('å')
2

What happens inside Anki/the add-on is that you put in the one character unicode object u'å':

>>> type(u'å')
<type 'unicode'>
>>> len(u'å')
1
>>> urllib.urlencode(dict(query=u'å'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/urllib.py", line 1343, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

I am pretty sure that the corresponding line of code is where you should put

urllib.urlencode(dict(query=field_data.word.encode('utf-8'))))

instead of

urllib.urlencode(dict(query=field_data.word)))

but, today again, can’t find the time to test this. (…)

ospalh commented 7 years ago

self.url + urllib.urlencode(dict(query=field_data.word.encode('UTF-8'))))

… as i was typing out. Yeah, I think that should work.

antortjim commented 7 years ago

Yes exactly, it works!!

screenshot from 2017-01-10 23-20-27

antortjim commented 7 years ago

Just created a pull request (my first one ever) featuring this little change. Thank you for your great help :smile:

ospalh / anki-addons

Searching words with Danish characters at DDO #97