Closed rpanah closed 7 years ago
I've been seeing a similar, but slightly different, error. Seems to manifest as 0-byte results files.
2016-07-18 20:26:56,721 client.py(line 402) ERROR: Error saving results for baseline to file: 'utf8' codec can't decode byte 0xe1 in position 5: unexpected end of data
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/centinel/client.py", line 398, in run_exp
json.dump(results, result_file, indent=2, separators=(',', ': '))
File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
for chunk in iterable:
File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 5: unexpected end of data
There seems to be a problem in the HTTP primitive that fails to encode the results properly. I'm looking into that.
This problem was particularly hard to reproduce since it is fairly uncommon (1 in ~2000 websites) to have non-UTF-8 characters in the headers (they're usually simple ASCII), and even harder to diagnose since the encoder will not say exactly where in the huge JSON object the error occurs (the example below was taken out of a 90MB file with ~0.5 million lines of data).
The encoding issue here stems from non-unicode characters in HTTP request/response headers. The UTF-8 decoder can't decode the non-Unicode characters and will cause the entire result file to fail to write to disk.
The solution that seems to work, would be to ignore encoding errors (ensure_ascii=False
) when writing the JSON object to file:
json.dump(results, result_file, indent=2, separators=(',', ': '), ensure_ascii=False)
Doing this will break server-side analysis for results that have non-Unicode characters (JSON decoding will fail), so we will need to handle (e.g. replace) non-decodable characters on the server when analyzing results.
Here's an example with a non-UTF-8 character in a cookie:
"http://www.zoomshare.com": {
"request": {
"headers": {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
},
"path": "/",
"host": "www.zoomshare.com",
"method": "GET",
"ssl": false
},
"response": {
"status": 200,
"headers": {
"Transfer-Encoding": "chunked",
"Set-Cookie": "dw=†IÃ; path=/",
"Server": "Apache/1.3.41 (Unix) mod_perl/1.30 mod_fastcgi/2.4.6",
"Last-Modified": "Thu, 04 Sep 2014 19:54:14 GMT",
"Connection": "close",
"Date": "Sat, 03 Sep 2016 20:45:10 GMT",
"Content-Type": "text/html; charset=iso-8859-1"
},
"reason": "OK",
"body": "[...]"
}
}