matomo-org / matomo-log-analytics

Import any kind of server logs in Matomo for powerful log analytics. Universal log file parsing and reporting.
https://matomo.org/log-analytics/
GNU General Public License v3.0
224 stars 118 forks source link

import_logs.py - Import logs errors with UnicodeEncodeError - Matomo 4.1.0 + Python 3.9.1 #299

Closed user908348 closed 3 years ago

user908348 commented 3 years ago

We recently upgraded to 4.0 and then 4.1.0, we upgraded to Python 3.9.1 to be able to run import_logs.py as well.

It seems some of our server logs (recently by the looks of it) have a unicode character in it with prevents the nightly import job we run to work correctly - see error below.

Problem seems to be in parse.py and as far as we can tell the problematic character (dot-dot-dot) - if we remove/replace this character from the log files it works

Is there a setting/option we an invoke in Matomo to "bypass" these errors or should we resort to stripping the character from the logs before start the nightly job.

Just to note we didn't have this issue with Matomo 3+Python 2.

Error:

Exception in thread Thread-2:
Traceback (most recent call last):
File "c:\Python391\lib\threading.py", line 954, in _bootstrap_inner
self.run()
File "c:\Python391\lib\threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "e:\www\matomo.domain\misc\log-analytics\import_logs.py", line 1849, in _run_bulk
self._record_hits(hits)
File "e:\www\matomo.domain\misc\log-analytics\import_logs.py", line 1995, in _record_hits
'requests': [self._get_hit_args(hit) for hit in hits]
File "e:\www\matomo.domain\misc\log-analytics\import_logs.py", line 1995, in <listcomp>
'requests': [self._get_hit_args(hit) for hit in hits]
File "e:\www\matomo.domain\misc\log-analytics\import_logs.py", line 1953, in _get_hit_args
urllib.parse.quote(args['url'], ''),
File "c:\Python391\lib\urllib\parse.py", line 847, in quote
string = string.encode(encoding, errors)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc85' in position 58: surrogates not allowed
39199 lines parsed, 10200 lines recorded, 139 records/sec (avg), 200 records/sec (current)
39199 lines parsed, 10200 lines recorded, 137 records/sec (avg), 0 records/sec (current)
39199 lines parsed, 10200 lines recorded, 135 records/sec (avg), 0 records/sec (current)

 Related perhaps:

https://github.com/matomo-org/matomo-log-analytics/issues/278 https://github.com/matomo-org/matomo/pull/15618

parse.py - https://github.com/python/cpython/blob/3.9/Lib/urllib/parse.py#L847

Edit: we use w3c format server logs

user908348 commented 3 years ago

using the option --encoding=ansi resolved the issue for is

mwithheld commented 3 years ago

For future travelers, try --encoding=ascii