'utf-8' codec can't decode byte ... in position ...: invalid continuation byte

marcelpaulo commented 6 years ago

Bug report

I've just installed googler from master (8253de92f8f3c7a38c6069c3e41b559066e75a74) and was working through the examples in README.md when I stumbled across an error. I wanted to show 5 results and autocomplete on hello , so I typed:

googler -d -n 5 hello\ <tab>

There's a space between \ and <tab>. Here's the result (there was no debugging output):

+paulo@monk:~/tmp$ googler -d -n 5 hello\ [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 341: invalid continuation byte

Here's my environment:

OS: Xubuntu 17.10 Python: 3.6.3 Terminal emulator: xfce4-terminal 0.8.6 + tmux master (b5c0b2c) Shell: bash 4.4-5

+paulo@monk:~/src/tmux$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=pt_BR.UTF-8
LC_NUMERIC=pt_BR.UTF-8
LC_TIME=pt_BR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=pt_BR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=pt_BR.UTF-8
LC_NAME=pt_BR.UTF-8
LC_ADDRESS=pt_BR.UTF-8
LC_TELEPHONE=pt_BR.UTF-8
LC_MEASUREMENT=pt_BR.UTF-8
LC_IDENTIFICATION=pt_BR.UTF-8
LC_ALL=

Bash autocompletion: /home/paulo/src/googler/auto-completion/bash/googler-completion.bash

I tried using single quotes to preserve the space after hello but the error was exactly the same when I tried to autocomplete after the space:

+paulo@monk:~/tmp$ googler -n 5 'hello [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 339: invalid continuation byte

Not sure if this might be revelant, but the default python interpreter in Xubuntu 17.10 is 2.7.14. As I mentioned earlier, the python3 interpreter is 3.6.3. As googler has the shebang #!/usr/bin/env python3, I imagine that shouldn't be a problem.

Let me know how I can help debug this further.

jarun commented 6 years ago

I am not able to reproduce with the same steps.

Can you set your complete locale to any one of en_US.UTF-8 OR pt_BR.UTF-8 and try? I see there's a mix.

I imagine that shouldn't be a problem.

You are right.

marcelpaulo commented 6 years ago

Hey, that was speed-lightning quick, thanks !

Here we go, locale set to en_US.UTF-8:

+paulo@monk:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

But the error was precisely the same:

+paulo@monk:~$ googler -n 5 hello\ [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 341: invalid continuation byte

Just to clarify, I typed hello\ then the space key, and then the tab key to generate the autocompletions.

marcelpaulo commented 6 years ago

Just in case tmux might be getting in the way, I tried the same on a xfce4-terminal tab not running tmux, but the error was still the same:

+paulo@monk:~$ googler -n 5 hello\ [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 335: invalid continuation byte

I'm stumped :-(

zmwangx commented 6 years ago

googler --debug --complete hello

marcelpaulo commented 6 years ago

googler --debug --complete hello

Hey, @zmwangx, I didn't know that could be done, cool ! Here we go:

+paulo@monk:~$ googler --debug --complete 'hello '
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
Traceback (most recent call last):
  File "/usr/local/bin/googler", line 2573, in <module>
    main()
  File "/usr/local/bin/googler", line 2506, in main
    completer_run(opts.complete)
  File "/usr/local/bin/googler", line 2413, in completer_run
    completions = completer_fetch_completions(prefix)
  File "/usr/local/bin/googler", line 2394, in completer_fetch_completions
    respobj = json.loads(resp.read().decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 333: invalid continuation byte

zmwangx commented 6 years ago

There you go, the response for https://www.google.com/complete/search?client=psy-ab&q=hello is non-UTF-8 for whatever reason.

marcelpaulo commented 6 years ago

The error happens when completing after a space. If I try to complete on hello\ w, for instance, it works:

+paulo@monk:~$ googler --debug --complete 'hello w'
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
hello world
hello world c
hello world java
hello world python
hello what&#39;s your name
hello wisconsin
hello world php
hello world assembly
hello world html
hello world javascript

zmwangx commented 6 years ago

Still, the response should not be non-UTF-8. Please upload the exact content of https://www.google.com/complete/search?client=psy-ab&q=hello%20 that you get.

marcelpaulo commented 6 years ago

Please upload the exact content of https://www.google.com/complete/search?client=psy-ab&q=hello%20

["hello ",[["hello\u003cb\u003e google\u003c\/b\u003e",0],["hello\u003cb\u003e kitty\u003c\/b\u003e",0],["hello\u003cb\u003e hello\u003c\/b\u003e",0],["hello\u003cb\u003e moto\u003c\/b\u003e",0],["hello\u003cb\u003e neighbor\u003c\/b\u003e",0],["hello\u003cb\u003e adele\u003c\/b\u003e",0],["hello\u003cb\u003e how are you\u003c\/b\u003e",0],["hello\u003cb\u003e my twenties\u003c\/b\u003e",0],["hello\u003cb\u003e there\u003c\/b\u003e",0],["hello\u003cb\u003e world\u003c\/b\u003e",0]],{"q":"vkki_fDYbCj65UJbcwfXywHPK4c","t":{"bpc":false,"tlw":false}}]

zmwangx commented 6 years ago

That's an already decoded version. Maybe use xxd to find your 0xe7:

curl -s -A 'Python-urllib/3.6' 'https://www.google.com/complete/search?client=psy-ab&q=hello%20w' | xxd

Pay special attention to byte number 333.

marcelpaulo commented 6 years ago

Here's the hex dump:

00000000: 5b22 6865 6c6c 6f20 222c 5b5b 2268 656c  ["hello ",[["hel
00000010: 6c6f 5c75 3030 3363 625c 7530 3033 6520  lo\u003cb\u003e
00000020: 6e65 6967 6862 6f72 5c75 3030 3363 5c2f  neighbor\u003c\/
00000030: 625c 7530 3033 6522 2c30 5d2c 5b22 6865  b\u003e",0],["he
00000040: 6c6c 6f5c 7530 3033 6362 5c75 3030 3365  llo\u003cb\u003e
00000050: 206b 6974 7479 5c75 3030 3363 5c2f 625c   kitty\u003c\/b\
00000060: 7530 3033 6522 2c30 5d2c 5b22 6865 6c6c  u003e",0],["hell
00000070: 6f5c 7530 3033 6362 5c75 3030 3365 2064  o\u003cb\u003e d
00000080: 6172 6b6e 6573 7320 6d79 206f 6c64 2066  arkness my old f
00000090: 7269 656e 645c 7530 3033 635c 2f62 5c75  riend\u003c\/b\u
000000a0: 3030 3365 222c 305d 2c5b 2268 656c 6c6f  003e",0],["hello
000000b0: 5c75 3030 3363 625c 7530 3033 6520 676f  \u003cb\u003e go
000000c0: 6f67 6c65 5c75 3030 3363 5c2f 625c 7530  ogle\u003c\/b\u0
000000d0: 3033 6522 2c30 5d2c 5b22 6865 6c6c 6f5c  03e",0],["hello\
000000e0: 7530 3033 6362 5c75 3030 3365 2061 6465  u003cb\u003e ade
000000f0: 6c65 5c75 3030 3363 5c2f 625c 7530 3033  le\u003c\/b\u003
00000100: 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 7530  e",0],["hello\u0
00000110: 3033 6362 5c75 3030 3365 206d 6f74 6f5c  03cb\u003e moto\
00000120: 7530 3033 635c 2f62 5c75 3030 3365 222c  u003c\/b\u003e",
00000130: 305d 2c5b 2268 656c 6c6f 5c75 3030 3363  0],["hello\u003c
00000140: 625c 7530 3033 6520 7472 6164 75e7 e36f  b\u003e tradu..o
00000150: 5c75 3030 3363 5c2f 625c 7530 3033 6522  \u003c\/b\u003e"
00000160: 2c30 5d2c 5b22 6865 6c6c 6f5c 7530 3033  ,0],["hello\u003
00000170: 6362 5c75 3030 3365 206d 7920 7477 656e  cb\u003e my twen
00000180: 7469 6573 5c75 3030 3363 5c2f 625c 7530  ties\u003c\/b\u0
00000190: 3033 6522 2c30 5d2c 5b22 6865 6c6c 6f5c  03e",0],["hello\
000001a0: 7530 3033 6362 5c75 3030 3365 2068 656c  u003cb\u003e hel
000001b0: 6c6f 5c75 3030 3363 5c2f 625c 7530 3033  lo\u003c\/b\u003
000001c0: 6522 2c30 5d2c 5b22 6865 6c6c 6f5c 7530  e",0],["hello\u0
000001d0: 3033 6362 5c75 3030 3365 206c 6574 7261  03cb\u003e letra
000001e0: 5c75 3030 3363 5c2f 625c 7530 3033 6522  \u003c\/b\u003e"
000001f0: 2c30 5d5d 2c7b 2271 223a 226d 4a6c 4d5f  ,0]],{"q":"mJlM_
00000200: 4a62 744e 7843 3553 5077 4c65 4776 6a46  JbtNxC5SPwLeGvjF
00000210: 5578 6435 4849 222c 2274 223a 7b22 6270  Uxd5HI","t":{"bp
00000220: 6322 3a66 616c 7365 2c22 746c 7722 3a66  c":false,"tlw":f
00000230: 616c 7365 7d7d 5d                        alse}}]

Oh, byte 333 (0x14d) is 0xe7 == ç in iso-8859-1. Byte 0x14e is 0xe3 == ã in iso-8859-1. The word is tradução == translation, in Portuguese.

So, google seems to be returning iso-8859-1 !

EDIT: @jarun couldn't reproduce it because he's not in Brazil (I am), so google doesn't send him results in Portuguese. I imagine using --lang=pt should reproduce the error.

zmwangx commented 6 years ago

Returning JSON as Latin-1 is really hostile. We can't add chardet, but we can use Latin-1 as fallback .

jarun commented 6 years ago

Returning JSON as Latin-1 is really hostile. We can't add chardet, but we can use Latin-1 as fallback .

Does the server indicate the encoding it uses in any of the packets during negotiation? I am just worried that it might send some other encoding in some other country.

If that's not the case, we will have to try Latin-1 and keep discovering...

marcelpaulo commented 6 years ago

Does the server indicate the encoding it uses in any of the packets during negotiation?

How do I dump that ?

EDIT: Please forgive my ignorance !

jarun commented 6 years ago

In the HTTP response we might have:

Content-Type: text/html; charset=xxxx

zmwangx commented 6 years ago

Yeah, Content-Type could specify the charset too. My bad.

marcelpaulo commented 6 years ago

Here we are:

:paulo@monk:~/tmp$ curl -s -v -A 'Python-urllib/3.6' 'https://www.google.com/complete/search?client=psy-ab&q=hello%20' | tee hello.txt
*   Trying 2800:3f0:4001:80c::2004...
* TCP_NODELAY set
* Connected to www.google.com (2800:3f0:4001:80c::2004) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [102 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2940 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [149 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google Inc; CN=www.google.com
*  start date: Mar 20 16:53:00 2018 GMT
*  expire date: Jun 12 16:53:00 2018 GMT
*  subjectAltName: host "www.google.com" matched cert's "www.google.com"
*  issuer: C=US; O=Google Inc; CN=Google Internet Authority G2
*  SSL certificate verify ok.
} [5 bytes data]
> GET /complete/search?client=psy-ab&q=hello%20 HTTP/1.1
> Host: www.google.com
> User-Agent: Python-urllib/3.6
> Accept: */*
>
{ [5 bytes data]
< HTTP/1.1 200 OK
< Date: Tue, 10 Apr 2018 02:59:31 GMT
< Expires: Tue, 10 Apr 2018 02:59:31 GMT
< Cache-Control: private, max-age=3600
< Content-Type: application/json; charset=ISO-8859-1
< Content-Disposition: attachment; filename="f.txt"
< Server: gws
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< Alt-Svc: hq=":443"; ma=2592000; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="42,41,39,35"
< Accept-Ranges: none
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
<
{ [574 bytes data]
* Connection #0 to host www.google.com left intact
["hello ",[["hello\u003cb\u003e neighbor\u003c\/b\u003e",0],["hello\u003cb\u003e kitty\u003c\/b\u003e",0],["hello\u003cb\u003e darkness my old friend\u003c\/b\u003e",0],["hello\u003cb\u003e google\u003c\/b\u003e",0],["hello\u003cb\u003e adele\u003c\/b\u003e",0],["hello\u003cb\u003e moto\u003c\/b\u003e",0],["hello\u003cb\u003e traduo\u003c\/b\u003e",0],["hello\u003cb\u003e my twenties\u003c\/b\u003e",0],["hello\u003cb\u003e hello\u003c\/b\u003e",0],["hello\u003cb\u003e letra\u003c\/b\u003e",0]],{"q":"jfwpGK6dSAvJMwbujrL05TKuRaA","t":{"bpc":false,"tlw":false}}]

zmwangx commented 6 years ago

Side note: You don't really need -v, -D- will give you the response headers; or if you don't care about the response body, a HEAD request with -I will do.

jarun commented 6 years ago

Content-Type: application/json; charset=ISO-8859-1

marcelpaulo commented 6 years ago

You don't really need -v, -D- will give you the response headers; or if you don't care about the response body, a HEAD request with -I will do

This is being a really fruitful learning experience for me, thank you very much for that, @zmwangx and @jarun !

jarun commented 6 years ago

@marcelpaulo now for the next learning session, would you be interested in sending over a patch? Take your time, no issues! ;)

jarun commented 6 years ago

In case you guys have missed my note, the server response specifies the correct charset:

Content-Type: application/json; charset=ISO-8859-1

jarun commented 6 years ago

It's the same here in India as well:

HTTP/1.1 200 OK
Date: Tue, 10 Apr 2018 03:11:13 GMT
Expires: Tue, 10 Apr 2018 03:11:13 GMT
Cache-Control: private, max-age=3600
Content-Type: application/json; charset=ISO-8859-1
Content-Disposition: attachment; filename="f.txt"
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alt-Svc: hq=":443"; ma=2592000; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="42,41,39,35"
Transfer-Encoding: chunked
Accept-Ranges: none
Vary: Accept-Encoding

marcelpaulo commented 6 years ago

would you be interested in sending over a patch?

As an argentinian writer once said about the Russian language, I could paraphrase: my ignorance of Python and REST is almost perfect, but ... I'll give it a try ! I can see the problem is here:

def completer_fetch_completions(prefix):
    import json
    import re
    import urllib.request

    # One can pass the 'hl' query param to specify the language. We
    # ignore that for now.
    api_url = ('https://www.google.com/complete/search?client=psy-ab&q=%s' %
               urllib.parse.quote(prefix, safe=''))
    # A timeout of 3 seconds seems to be overly generous already.
    resp = urllib.request.urlopen(api_url, timeout=3)
    respobj = json.loads(resp.read().decode('utf-8'))

After getting resp, I imagine we need to extract the charset, and then call decode() for that charset.

jarun commented 6 years ago

my ignorance of Python and REST is almost perfect

Language is a medium, even deaf and dumb people communicate precisely. So no worries.

After getting resp, I imagine we need to extract the charset, and the call decode for that charset.

Exactly!

jarun commented 6 years ago

It's the same here in India as well

@zmwangx what do you see there?

I have this hunch they always use charset=ISO-8859-1 in this case... ;)

marcelpaulo commented 6 years ago

@jarun, would this solve the problem ?

If that's the case, the code might be:

    # A timeout of 3 seconds seems to be overly generous already.
    resp = urllib.request.urlopen(api_url, timeout=3)
    respobj = json.loads(resp.read().decode(resp.headers.get_content_charset())

marcelpaulo commented 6 years ago

Now, this is puzzling: I changed that single line in the code (respobj = ...), and when I try to complete, I get the same error:

+paulo@monk:~/src/googler$ ./googler 'hello [ERROR] 'utf-8' codec can't decode byte 0xe7 in position 333: invalid continuation byte

but, if I generate the completions as @zmwangx suggested:

:paulo@monk:~/src/googler$ ./googler --debug --complete 'hello '
[DEBUG] googler version 3.5
[DEBUG] Python version 3.6.3
hello neighbor
hello kitty
hello darkness my old friend
hello google
hello adele
hello moto
hello tradução
hello my twenties
hello hello
hello letra

When I think I got it, it just slips through my fingers !

marcelpaulo commented 6 years ago

Ah, I know what's wrong ! I'm running googler from the git directory, but _googler is getting it from $PATH, which is the installed, non-edited version, which still tries to decode utf-8 !

May I send a PR ? My ignorance of git is not so perfect, so I think I can do it ;-)

jarun commented 6 years ago

Please send the PR. Just print resp.headers.get_content_charset() in logger.debug mode as well.

marcelpaulo commented 6 years ago

now for the next learning session, would you be interested in sending over a patch?

You were right, @jarun, this really rounded off nicely the learning experience ! Thank you so much, both @jarun and @zmwangx, for your patience with me: I learned a lot tonight from you guys !

marcelpaulo commented 6 years ago

Fixed by #228

jarun / googler

'utf-8' codec can't decode byte ... in position ...: invalid continuation byte #227

Bug report