encoding exception when storing results

rpanah commented 8 years ago

2016-07-14 18:34:29,043 client.py(line 428) ERROR: Error saving results for baseline to file: 'utf8' codec can't decode byte 0xe0 in position 3: invalid continuation byte
Traceback (most recent call last):
  File "/Users/abbas/iclab/centinel/centinel/client.py", line 420, in run_exp
    json.dump(results, result_file, indent=2, separators=(',', ': '))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 3: invalid continuation byte

sneft commented 8 years ago

I've been seeing a similar, but slightly different, error. Seems to manifest as 0-byte results files.

2016-07-18 20:26:56,721 client.py(line 402) ERROR: Error saving results for baseline to file: 'utf8' codec can't decode byte 0xe1 in position 5: unexpected end of data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/centinel/client.py", line 398, in run_exp
    json.dump(results, result_file, indent=2, separators=(',', ': '))
  File "/usr/lib/python2.7/json/__init__.py", line 189, in dump
    for chunk in iterable:
  File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 332, in _iterencode_list
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 5: unexpected end of data

rpanah commented 8 years ago

There seems to be a problem in the HTTP primitive that fails to encode the results properly. I'm looking into that.

rpanah commented 7 years ago

This problem was particularly hard to reproduce since it is fairly uncommon (1 in ~2000 websites) to have non-UTF-8 characters in the headers (they're usually simple ASCII), and even harder to diagnose since the encoder will not say exactly where in the huge JSON object the error occurs (the example below was taken out of a 90MB file with ~0.5 million lines of data).

The encoding issue here stems from non-unicode characters in HTTP request/response headers. The UTF-8 decoder can't decode the non-Unicode characters and will cause the entire result file to fail to write to disk.

The solution that seems to work, would be to ignore encoding errors (ensure_ascii=False) when writing the JSON object to file: json.dump(results, result_file, indent=2, separators=(',', ': '), ensure_ascii=False)

Doing this will break server-side analysis for results that have non-Unicode characters (JSON decoding will fail), so we will need to handle (e.g. replace) non-decodable characters on the server when analyzing results.

Here's an example with a non-UTF-8 character in a cookie:

        "http://www.zoomshare.com": {
          "request": {
            "headers": {
              "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"
            },
            "path": "/",
            "host": "www.zoomshare.com",
            "method": "GET",
            "ssl": false
          },
          "response": {
            "status": 200,
            "headers": {
              "Transfer-Encoding": "chunked",
              "Set-Cookie": "dw=†IÃ; path=/",
              "Server": "Apache/1.3.41 (Unix) mod_perl/1.30 mod_fastcgi/2.4.6",
              "Last-Modified": "Thu, 04 Sep 2014 19:54:14 GMT",
              "Connection": "close",
              "Date": "Sat, 03 Sep 2016 20:45:10 GMT",
              "Content-Type": "text/html; charset=iso-8859-1"
            },
            "reason": "OK",
            "body": "[...]"
          }
        }

iclab / centinel

encoding exception when storing results #257