Adding results.JSONReader

yaleman commented 2 years ago

results.ResultsReader is slow because it's iterating byte-by-byte through the stream to parse the XML in a way the chosen parser will be happy. I've added JSONResultsReader to provide a much more performant option.

Benefits:

Less data transferred over the wire.
~99% faster on most things I can test.

Other changes in this commit

Removed unnecessary imports.
Fixed some places where indenting was 2 spaces, not 4.
Renamed some input variables which aren't used by name, to match Python standards.

Test Data

Running the tests from https://github.com/yaleman/splunk-sdk-games/

(you can just clone the repo, configure it and run ./run_tests.sh)

Generating the local files, so that we're not testing the response of my Splunk instance:

➜ ./run_tests.sh
Generating local JSON data

real    0m25.155s
user    0m0.532s
sys 0m0.661s

Generating local XML data

real    0m29.173s
user    0m0.991s
sys 0m1.462s

File sizes

-rw-r--r--  1 yaleman  staff   161M 12 Feb 13:17 outputfile.json
-rw-r--r--  1 yaleman  staff   374M 12 Feb 13:17 outputfile.xml

Using results.ResultsReader and the job in XML output format:

Running test_file_resultsreader.py
message:  "INFO: Your timerange was substituted based on your search string"
Results: 113625
Preview results: 9963

real    2m2.416s
user    2m0.997s
sys 0m0.405s

Using results.JSONResultsReader and the job in JSON output format:

Running test_file_jsonreader.py
Results: 113624
Preview results: 12500

real    0m1.023s
user    0m0.856s
sys 0m0.090s

The 1 "missing" result is the Message that the JSON export endpoint doesn't return.

AndyXan commented 2 years ago

Hmm. Just for clarification:

If I'm just issuing (non-export, just regular) querys via the splunk sdk and getting a job back

job = self.service.jobs.create(search_query, **search_params)
current_results = job.results(output_mode='json', count=50000, offset=0)
reader = results.JSONResultsReader(current_results)

The API returns a dict containing messages = [...] and results [...], is_preview and so on.

Your implementation of

    def _parse_results(self, stream: BufferedReader):
        """Parse results and messages out of *stream*."""
        for line in stream.readlines():

            event = json_loads(line)
            if "preview" in event:
                self.is_preview = event["preview"]
            if "msg" in event:
                msg_type = event.get("type", "Uknown Message Type")
                text = event.get("text")
                yield Message(msg_type, text)
            yield event

Would not return single events. The "single" event would consist of arrays (results) containing the 50.000 events. This is rather different from

current_results = job.results(output_mode='xml', count=50000, offset=0)
reader = results.ResultsReader(io.BufferedReader(ResponseReaderWrapper(current_results)))

for result in reader:
    if isinstance(result, dict):
        log.info("Result: %s" % result)
    elif isinstance(result, results.Message):
        log.info("Message: %s" % result)
log.info("is_preview = %s " % reader.is_preview)

which would print 50.000 log lines. Above solution would log 1 time, giant dict with key results containing 50.000 events

AndyXan commented 2 years ago

This is more of what I would expect:

    def _parse_results(self, stream: BufferedReader):
        """Parse results and messages out of *stream*."""
        for line in stream.readlines():

            bulk = json_loads(line)
            if "preview" in bulk:
                self.is_preview = bulk["preview"]

            for message in bulk.get('messages', []):
                msg_type = message.get("type", "Uknown Message Type")
                text = message.get("text")
                yield Message(msg_type, text)

            for event in bulk.get('results', []):
                yield event

yaleman commented 2 years ago

I think 78079ae updates the code to something that should work.

I've updated my test code over in https://github.com/yaleman/splunk-sdk-games so that seems to work too.

Running test_file_jsonreader.py
RESULT_COUNT=113624
MESSAGE_COUNT=0
PREVIEW_COUNT=12500

real    0m1.034s
user    0m0.873s
sys 0m0.089s

Running test_file_jsonreader_create.py
RESULT_COUNT=113624
MESSAGE_COUNT=12
PREVIEW_COUNT=0

real    0m1.035s
user    0m0.837s
sys 0m0.140s

Running test_file_resultsreader.py
RESULT_COUNT=113624
MESSAGE_COUNT=1
PREVIEW_COUNT=25000

real    2m17.480s
user    2m16.443s
sys 0m0.706s

If someone can provide a way of testing it in docker or something that'd be cool but I think it works.

I can't run the built-in tests because python 3.7 on an m1 macBook doesn't have _ctypes and I really don't want to spend more hours building VMs to test a library I don't even use.

ashah-splunk commented 2 years ago

@yaleman thanks for the PR. We are considering the provided suggestion to use JSONResultsReader and exploring more on the same.

ashah-splunk commented 2 years ago

@yaleman we have considered your changes for JSONResultsReader and have added some modifications as well which are now available in the latest Python SDK release 1.6.19.

splunk / splunk-sdk-python

Adding results.JSONReader #435

Other changes in this commit

Test Data