Python 3 handling of unicode (non-ascii) text - with fix

hobbyhack commented 8 years ago

The python3 version wasn't working for me with Arabic input language. It seems like it might be broken with all non ascii letter systems. However there is a way to first convert the text to an HTML escape code then send it to you function with the following code:

def convertUnicodeToHTMLEsc(text): htmlEsc = str(text.encode()).replace("b\'\x", "%").replace("\x", "%").replace("\'", '') return htmlEsc

then after the comment in your function put this line: to_translate = convertUnicodeToHTMLEsc(to_translate)

I am more of a code hacker than a coder so there might be a easier way but this works.

hobbyhack commented 8 years ago

Actually after I posted this I realized that this code breaks this going from english to Arabic. So maybe the best thing to do is to use that code in a completely separate function.

mouuff commented 8 years ago

you should fork me and make a pull request :)

hobbyhack commented 8 years ago

Wow. I don't understand the lingo but this sounds like a great offer. I will see if I can figure out how to do this. However, if anyone else wants to take this and use it please do.

Shane

On Tue, Sep 6, 2016 at 11:36 AM, Arnaud Aliès notifications@github.com wrote:

you should fork me and make a pull request :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mouuff/Google-Translate-API/issues/5#issuecomment-244885753, or mute the thread https://github.com/notifications/unsubscribe-auth/AAf3feoJTvkzLbq_UQmkSPA1gOhWqjcLks5qnSYBgaJpZM4Jzcdy .

mouuff commented 8 years ago

I actually don't have this problem on both python2 and python3 version: Hola como estas? >> Hello how are you? Hola como estas? >> Привет, как ты? Hola como estas? >> مرحبا كيف حالك؟ identity >> identité

mouuff commented 8 years ago

hobbyhack, github allows you to copy my code "fork", and edit it, once you done that you can put your changes here with a "pull request"

hobbyhack commented 8 years ago

The problem is going from unicode. It is related to the way python3 handles web URLs. However, this would on the Google side if Python3 would handle it.

Here is some code that would allow you to reproduce it (if the problem is not my terminal):

print(Utilities.translate('system', "ar", "en"))
print(Utilities.translate('نظام', "ar", "en"))

On my machine, going from English to Arabic works fine. However, going from Arabic to English errors. Here is result:

نظام
Traceback (most recent call last):
  File "/Users/shanegary/Library/Mobile Documents/com~apple~CloudDocs/Data/AppDev/fadal/Main.py", line 23, in <module>
    print(Utilities.translate('نظام', "ar", "en"))
  File "/Users/shanegary/Library/Mobile Documents/com~apple~CloudDocs/Data/AppDev/fadal/Utilities.py", line 57, in translate
    page = urllib.request.urlopen(request).read().decode("utf-8")
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 163, in url open
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 1141, in _send_request
    self.putrequest(method, url, **skips)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/http/client.py", line 983, in put request
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 21-24: ordinal not in range(128)

Process finished with exit code 1

I used the same word to test in both directions and the English to Arabic worked.

I have fixed this in my code by adding the functions. I am not sure if this would work for other languages. However, I tested with with a bunch of Arabic words and it worked great:

def convertUnicodeToHTMLEsc(text):
    htmlEsc = str(text.encode()).replace("b\'\\x", "%").replace("\\x", "%").replace("\'", '')
    return htmlEsc

def translateFromUnicode(to_translate, to_language="auto", language="auto"):
    htmlEsc = convertUnicodeToHTMLEsc(to_translate)
    translation = translate(htmlEsc, to_language, language)
    return (translation)

I just call translateFromUnicode() when I am translating from Arabic and call your function directly when I am translating from English. I should have some time next week to fork your code and post these new functions.

It looks to me like there is a bug open with Python. However, they seem to have a good reason not to fix it (URLs are supposed to be ascii according to the standard Python is quoting). Python issue # 3991

mouuff commented 8 years ago

try adding "&ie=UTF-8" in link

like this: link = "http://translate.google.com/m?hl=%s&sl=%s&q=%s&ie=UTF-8" % (to_langage, langage, to_translate.replace(" ", "+"))

(btw I should recode this part using url encoder and regex ...)

hobbyhack commented 8 years ago

I tried adding &ie=UTF-8 and it still doesn't work on my machine. Just like the original code, if I output the url being tried and paste it into a browser it gives me the same page as when I convert to html escape code. So I think the code should work as is.

It seems like python3 is just refusing to try the URL which would actually work. The python2 code is working fine. It might just be my machine. However, I am pretty sure this is python issue #3991.

It might be worth waiting on Python dev to add the "enhancement" back to python3 instead of accepting my pull request. My addition to the code makes things more complicated because you would use a different function based on if someone has the issue or not. And even then, you would translate one direction with your original function and the other with mine.

I haven't coded in a decade and even a decade ago I never wrote much more than automation of daily tasks. I have shared some code but have never tried sharing how I modified other peoples code. I wasn't sure how much I should be actually updating the function you wrote versus just providing a function others could use if they had the same issue.

mouuff commented 8 years ago

I pushed a new version, this should work tell me if you still have the issue

hobbyhack commented 8 years ago

It works great, thanks! This is much cleaner than my "fix".

mouuff / mtranslate

Python 3 handling of unicode (non-ascii) text - with fix #5