splunk / splunk-sdk-python

Splunk Software Development Kit for Python
http://dev.splunk.com
Apache License 2.0
687 stars 369 forks source link

JSONResultsReader exception when record contains invalid UTF-8 characters #540

Open ericatdropzone opened 1 year ago

ericatdropzone commented 1 year ago

I'm using the botsv3 dataset and running running code similar to this:

from splunklib.client import connect
from splunklib.results import JSONResultsReader
from time import sleep

spl_query = "search index=botsv3 sourcetype=stream:udp earliest=0"
connection = connect(host=host, port=port, username=user, password=password, autologin=True)
job = self.connection.jobs.create(spl_query)

# Wait for the job to complete
sleep(5)

reader = JSONResultsReader(job.results(output_mode="json", earliest_time=earliest_time, count=max_results))
for result in reader: # This throws an exception
    ...

I'm seeing this exception:

Traceback (most recent call last):
  File "/app/splunk_scanner/splunk_connection.py", line 54, in query
    for result in reader:
  File "/usr/local/lib/python3.11/site-packages/splunklib/results.py", line 352, in next
    return next(self._gen)
           ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splunklib/results.py", line 361, in _parse_results
    parsed_line = json_loads(strip_line)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/json/__init__.py", line 341, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 89348: invalid start byte

This looks similar to this issue, but running version 1.7.4 didn't fix this instance of the problem. I also noticed that this pull request appears to fix the issue, but I'm not sure if that's the approach you'd want to take

Splunk (please complete the following information):

SDK (please complete the following information):

kleptog commented 1 year ago

Well this is disappointing. We've had to work around Splunk returning incorrectly encoded XML is the case of binary data, and I was hoping that in the switch to JSON they would have fixed that. So now we can look ahead to working around this in JSON as well (yay!)