matomo-org / matomo-log-analytics

Import any kind of server logs in Matomo for powerful log analytics. Universal log file parsing and reporting.
https://matomo.org/log-analytics/
GNU General Public License v3.0
226 stars 118 forks source link

import_logs.py - 'utf-8' codec can't encode character '\udcbf' in position 0: surrogates not allowed #334

Open mctunes opened 2 years ago

mctunes commented 2 years ago

While importing standard IIS log files using import_logs.py, the following exception was thrown when processing one particular file:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 1864, in _run_bulk
    self._record_hits(hits)
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 2010, in _record_hits
    'requests': [self._get_hit_args(hit) for hit in hits]
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 2010, in <listcomp>
    'requests': [self._get_hit_args(hit) for hit in hits]
  File "/var/www/html/piwik/misc/log-analytics/import_logs.py", line 1971, in _get_hit_args
    urllib.parse.quote(args['urlref'], '')
  File "/usr/lib/python3.8/urllib/parse.py", line 853, in quote
    string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbf' in position 0: surrogates not allowed

Expected Behavior

Log should be imported successfully.

Current Behavior

Exception above is thrown.

Possible Solution

Steps to Reproduce (for Bugs)

  1. Execute import_logs.py for one particular file:
/var/www/html/piwik/misc/log-analytics/import_logs.py /logs/u_ex220127.log \
    --url=https://analytics.example.com \
    --idsite=1 \
    --recorders=4 \
    --accept-invalid-ssl-certificate \
    --enable-http-errors \
    --enable-bots \
    --exclude-path="/cf_scripts/*" \
    --exclude-path="/tz_json/*" \
    --exclude-path="/*/assets/*" \
    --exclude-path="/*/cache/*" \
    --exclude-path="/*/css/*" \
    --exclude-path="/*/images/*"

Context

This has only happened once, on one particular file. We worked around it by removing the file from the batch, and processing then continued as normal.

Please let me know if there is any other information you need.

Your Environment