Closed marcelpaulo closed 6 years ago
I am not able to reproduce with the same steps.
Can you set your complete locale to any one of en_US.UTF-8
OR pt_BR.UTF-8
and try? I see there's a mix.
I imagine that shouldn't be a problem.
You are right.
Hey, that was speed-lightning quick, thanks !
Here we go, locale set to en_US.UTF-8
:
+paulo@monk:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
But the error was precisely the same:
+paulo@monk:~$ googler -n 5 hello\ [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 341: invalid continuation byte
Just to clarify, I typed hello\ then the space key, and then the tab key to generate the autocompletions.
Just in case tmux might be getting in the way, I tried the same on a xfce4-terminal tab not running tmux, but the error was still the same:
+paulo@monk:~$ googler -n 5 hello\ [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 335: invalid continuation byte
I'm stumped :-(
googler --debug --complete hello
googler --debug --complete hello
Hey, @zmwangx, I didn't know that could be done, cool ! Here we go:
+paulo@monk:~$ googler --debug --complete 'hello '
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
Traceback (most recent call last):
File "/usr/local/bin/googler", line 2573, in <module>
main()
File "/usr/local/bin/googler", line 2506, in main
completer_run(opts.complete)
File "/usr/local/bin/googler", line 2413, in completer_run
completions = completer_fetch_completions(prefix)
File "/usr/local/bin/googler", line 2394, in completer_fetch_completions
respobj = json.loads(resp.read().decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 333: invalid continuation byte
There you go, the response for https://www.google.com/complete/search?client=psy-ab&q=hello
is non-UTF-8 for whatever reason.
The error happens when completing after a space. If I try to complete on hello\ w
, for instance, it works:
+paulo@monk:~$ googler --debug --complete 'hello w'
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
hello world
hello world c
hello world java
hello world python
hello what's your name
hello wisconsin
hello world php
hello world assembly
hello world html
hello world javascript
Still, the response should not be non-UTF-8. Please upload the exact content of https://www.google.com/complete/search?client=psy-ab&q=hello%20
that you get.
Please upload the exact content of https://www.google.com/complete/search?client=psy-ab&q=hello%20
["hello ",[["hello\u003cb\u003e google\u003c\/b\u003e",0],["hello\u003cb\u003e kitty\u003c\/b\u003e",0],["hello\u003cb\u003e hello\u003c\/b\u003e",0],["hello\u003cb\u003e moto\u003c\/b\u003e",0],["hello\u003cb\u003e neighbor\u003c\/b\u003e",0],["hello\u003cb\u003e adele\u003c\/b\u003e",0],["hello\u003cb\u003e how are you\u003c\/b\u003e",0],["hello\u003cb\u003e my twenties\u003c\/b\u003e",0],["hello\u003cb\u003e there\u003c\/b\u003e",0],["hello\u003cb\u003e world\u003c\/b\u003e",0]],{"q":"vkki_fDYbCj65UJbcwfXywHPK4c","t":{"bpc":false,"tlw":false}}]
That's an already decoded version. Maybe use xxd
to find your 0xe7
:
curl -s -A 'Python-urllib/3.6' 'https://www.google.com/complete/search?client=psy-ab&q=hello%20w' | xxd
Pay special attention to byte number 333.
Here's the hex dump:
00000000: 5b22 6865 6c6c 6f20 222c 5b5b 2268 656c ["hello ",[["hel
00000010: 6c6f 5c75 3030 3363 625c 7530 3033 6520 lo\u003cb\u003e
00000020: 6e65 6967 6862 6f72 5c75 3030 3363 5c2f neighbor\u003c\/
00000030: 625c 7530 3033 6522 2c30 5d2c 5b22 6865 b\u003e",0],["he
00000040: 6c6c 6f5c 7530 3033 6362 5c75 3030 3365 llo\u003cb\u003e
00000050: 206b 6974 7479 5c75 3030 3363 5c2f 625c kitty\u003c\/b\
00000060: 7530 3033 6522 2c30 5d2c 5b22 6865 6c6c u003e",0],["hell
00000070: 6f5c 7530 3033 6362 5c75 3030 3365 2064 o\u003cb\u003e d
00000080: 6172 6b6e 6573 7320 6d79 206f 6c64 2066 arkness my old f
00000090: 7269 656e 645c 7530 3033 635c 2f62 5c75 riend\u003c\/b\u
000000a0: 3030 3365 222c 305d 2c5b 2268 656c 6c6f 003e",0],["hello
000000b0: 5c75 3030 3363 625c 7530 3033 6520 676f \u003cb\u003e go
000000c0: 6f67 6c65 5c75 3030 3363 5c2f 625c 7530 ogle\u003c\/b\u0
000000d0: 3033 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 03e",0],["hello\
000000e0: 7530 3033 6362 5c75 3030 3365 2061 6465 u003cb\u003e ade
000000f0: 6c65 5c75 3030 3363 5c2f 625c 7530 3033 le\u003c\/b\u003
00000100: 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 7530 e",0],["hello\u0
00000110: 3033 6362 5c75 3030 3365 206d 6f74 6f5c 03cb\u003e moto\
00000120: 7530 3033 635c 2f62 5c75 3030 3365 222c u003c\/b\u003e",
00000130: 305d 2c5b 2268 656c 6c6f 5c75 3030 3363 0],["hello\u003c
00000140: 625c 7530 3033 6520 7472 6164 75e7 e36f b\u003e tradu..o
00000150: 5c75 3030 3363 5c2f 625c 7530 3033 6522 \u003c\/b\u003e"
00000160: 2c30 5d2c 5b22 6865 6c6c 6f5c 7530 3033 ,0],["hello\u003
00000170: 6362 5c75 3030 3365 206d 7920 7477 656e cb\u003e my twen
00000180: 7469 6573 5c75 3030 3363 5c2f 625c 7530 ties\u003c\/b\u0
00000190: 3033 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 03e",0],["hello\
000001a0: 7530 3033 6362 5c75 3030 3365 2068 656c u003cb\u003e hel
000001b0: 6c6f 5c75 3030 3363 5c2f 625c 7530 3033 lo\u003c\/b\u003
000001c0: 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 7530 e",0],["hello\u0
000001d0: 3033 6362 5c75 3030 3365 206c 6574 7261 03cb\u003e letra
000001e0: 5c75 3030 3363 5c2f 625c 7530 3033 6522 \u003c\/b\u003e"
000001f0: 2c30 5d5d 2c7b 2271 223a 226d 4a6c 4d5f ,0]],{"q":"mJlM_
00000200: 4a62 744e 7843 3553 5077 4c65 4776 6a46 JbtNxC5SPwLeGvjF
00000210: 5578 6435 4849 222c 2274 223a 7b22 6270 Uxd5HI","t":{"bp
00000220: 6322 3a66 616c 7365 2c22 746c 7722 3a66 c":false,"tlw":f
00000230: 616c 7365 7d7d 5d alse}}]
Oh, byte 333 (0x14d) is 0xe7 == ç in iso-8859-1. Byte 0x14e is 0xe3 == ã in iso-8859-1. The word is tradução == translation, in Portuguese.
So, google seems to be returning iso-8859-1 !
EDIT: @jarun couldn't reproduce it because he's not in Brazil (I am), so google doesn't send him results in Portuguese. I imagine using --lang=pt
should reproduce the error.
Returning JSON as Latin-1 is really hostile. We can't add chardet, but we can use Latin-1 as fallback .
Returning JSON as Latin-1 is really hostile. We can't add chardet, but we can use Latin-1 as fallback .
Does the server indicate the encoding it uses in any of the packets during negotiation? I am just worried that it might send some other encoding in some other country.
If that's not the case, we will have to try Latin-1 and keep discovering...
Does the server indicate the encoding it uses in any of the packets during negotiation?
How do I dump that ?
EDIT: Please forgive my ignorance !
In the HTTP response we might have:
Content-Type: text/html; charset=xxxx
Yeah, Content-Type could specify the charset too. My bad.
Here we are:
:paulo@monk:~/tmp$ curl -s -v -A 'Python-urllib/3.6' 'https://www.google.com/complete/search?client=psy-ab&q=hello%20' | tee hello.txt
* Trying 2800:3f0:4001:80c::2004...
* TCP_NODELAY set
* Connected to www.google.com (2800:3f0:4001:80c::2004) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [102 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2940 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [149 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: C=US; ST=California; L=Mountain View; O=Google Inc; CN=www.google.com
* start date: Mar 20 16:53:00 2018 GMT
* expire date: Jun 12 16:53:00 2018 GMT
* subjectAltName: host "www.google.com" matched cert's "www.google.com"
* issuer: C=US; O=Google Inc; CN=Google Internet Authority G2
* SSL certificate verify ok.
} [5 bytes data]
> GET /complete/search?client=psy-ab&q=hello%20 HTTP/1.1
> Host: www.google.com
> User-Agent: Python-urllib/3.6
> Accept: */*
>
{ [5 bytes data]
< HTTP/1.1 200 OK
< Date: Tue, 10 Apr 2018 02:59:31 GMT
< Expires: Tue, 10 Apr 2018 02:59:31 GMT
< Cache-Control: private, max-age=3600
< Content-Type: application/json; charset=ISO-8859-1
< Content-Disposition: attachment; filename="f.txt"
< Server: gws
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< Alt-Svc: hq=":443"; ma=2592000; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="42,41,39,35"
< Accept-Ranges: none
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
<
{ [574 bytes data]
* Connection #0 to host www.google.com left intact
["hello ",[["hello\u003cb\u003e neighbor\u003c\/b\u003e",0],["hello\u003cb\u003e kitty\u003c\/b\u003e",0],["hello\u003cb\u003e darkness my old friend\u003c\/b\u003e",0],["hello\u003cb\u003e google\u003c\/b\u003e",0],["hello\u003cb\u003e adele\u003c\/b\u003e",0],["hello\u003cb\u003e moto\u003c\/b\u003e",0],["hello\u003cb\u003e traduo\u003c\/b\u003e",0],["hello\u003cb\u003e my twenties\u003c\/b\u003e",0],["hello\u003cb\u003e hello\u003c\/b\u003e",0],["hello\u003cb\u003e letra\u003c\/b\u003e",0]],{"q":"jfwpGK6dSAvJMwbujrL05TKuRaA","t":{"bpc":false,"tlw":false}}]
Side note: You don't really need -v
, -D-
will give you the response headers; or if you don't care about the response body, a HEAD request with -I
will do.
Content-Type: application/json; charset=ISO-8859-1
You don't really need -v, -D- will give you the response headers; or if you don't care about the response body, a HEAD request with -I will do
This is being a really fruitful learning experience for me, thank you very much for that, @zmwangx and @jarun !
@marcelpaulo now for the next learning session, would you be interested in sending over a patch? Take your time, no issues! ;)
In case you guys have missed my note, the server response specifies the correct charset:
Content-Type: application/json; charset=ISO-8859-1
It's the same here in India as well:
HTTP/1.1 200 OK
Date: Tue, 10 Apr 2018 03:11:13 GMT
Expires: Tue, 10 Apr 2018 03:11:13 GMT
Cache-Control: private, max-age=3600
Content-Type: application/json; charset=ISO-8859-1
Content-Disposition: attachment; filename="f.txt"
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alt-Svc: hq=":443"; ma=2592000; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="42,41,39,35"
Transfer-Encoding: chunked
Accept-Ranges: none
Vary: Accept-Encoding
would you be interested in sending over a patch?
As an argentinian writer once said about the Russian language, I could paraphrase: my ignorance of Python and REST is almost perfect, but ... I'll give it a try ! I can see the problem is here:
def completer_fetch_completions(prefix):
import json
import re
import urllib.request
# One can pass the 'hl' query param to specify the language. We
# ignore that for now.
api_url = ('https://www.google.com/complete/search?client=psy-ab&q=%s' %
urllib.parse.quote(prefix, safe=''))
# A timeout of 3 seconds seems to be overly generous already.
resp = urllib.request.urlopen(api_url, timeout=3)
respobj = json.loads(resp.read().decode('utf-8'))
After getting resp
, I imagine we need to extract the charset, and then call decode()
for that charset.
my ignorance of Python and REST is almost perfect
Language is a medium, even deaf and dumb people communicate precisely. So no worries.
After getting resp, I imagine we need to extract the charset, and the call decode for that charset.
Exactly!
It's the same here in India as well
@zmwangx what do you see there?
I have this hunch they always use charset=ISO-8859-1
in this case... ;)
@jarun, would this solve the problem ?
If that's the case, the code might be:
# A timeout of 3 seconds seems to be overly generous already.
resp = urllib.request.urlopen(api_url, timeout=3)
respobj = json.loads(resp.read().decode(resp.headers.get_content_charset())
Now, this is puzzling: I changed that single line in the code (respobj = ...
), and when I try to complete, I get the same error:
+paulo@monk:~/src/googler$ ./googler 'hello [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 333: invalid continuation byte
but, if I generate the completions as @zmwangx suggested:
:paulo@monk:~/src/googler$ ./googler --debug --complete 'hello '
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
hello neighbor
hello kitty
hello darkness my old friend
hello google
hello adele
hello moto
hello tradução
hello my twenties
hello hello
hello letra
When I think I got it, it just slips through my fingers !
Ah, I know what's wrong ! I'm running googler
from the git directory, but _googler is getting it from $PATH, which is the installed, non-edited version, which still tries to decode utf-8 !
May I send a PR ? My ignorance of git is not so perfect, so I think I can do it ;-)
Please send the PR. Just print resp.headers.get_content_charset()
in logger.debug mode as well.
now for the next learning session, would you be interested in sending over a patch?
You were right, @jarun, this really rounded off nicely the learning experience ! Thank you so much, both @jarun and @zmwangx, for your patience with me: I learned a lot tonight from you guys !
Fixed by #228
Bug report
I've just installed googler from master (8253de92f8f3c7a38c6069c3e41b559066e75a74) and was working through the examples in , so I typed:
README.md
when I stumbled across an error. I wanted to show 5 results and autocomplete on hellogoogler -d -n 5 hello\ <tab>
There's a space between
\
and<tab>
. Here's the result (there was no debugging output):Here's my environment:
OS: Xubuntu 17.10 Python: 3.6.3 Terminal emulator: xfce4-terminal 0.8.6 + tmux master (b5c0b2c) Shell: bash 4.4-5
Bash autocompletion: /home/paulo/src/googler/auto-completion/bash/googler-completion.bash
I tried using single quotes to preserve the space after hello but the error was exactly the same when I tried to autocomplete after the space:
Not sure if this might be revelant, but the default python interpreter in Xubuntu 17.10 is 2.7.14. As I mentioned earlier, the python3 interpreter is 3.6.3. As googler has the shebang
#!/usr/bin/env python3
, I imagine that shouldn't be a problem.Let me know how I can help debug this further.