splunk / splunk-sdk-python

Splunk Software Development Kit for Python
http://dev.splunk.com
Apache License 2.0
698 stars 370 forks source link

MemoryError thrown from results.py while retrieving large export job #363

Closed crumpetcrusher closed 2 years ago

crumpetcrusher commented 3 years ago

Describe the bug Long-running and large export jobs writing out to file throw exception MemoryError from within results.py

To Reproduce Run export job using latest SDK release for something > 45GB (in my case) During export, see Traceback from results.py

Expected behavior Export job would yield all results without throwing MemoryError

Logs or Screenshots

  File "C:\dev\code\export_logs.py", line 51, in <module>
    for result in rr:
  File "C:\dev\Python39\lib\site-packages\splunklib\results.py", line 210, in next
    return next(self._gen)
  File "C:\dev\Python39\lib\site-packages\splunklib\results.py", line 219, in _parse_results
    for event, elem in et.iterparse(stream, events=('start', 'end')):
  File "C:\dev\Python39\lib\xml\etree\ElementTree.py", line 1256, in iterator
    data = source.read(16 * 1024)
  File "C:\dev\Python39\lib\site-packages\splunklib\results.py", line 105, in read
    txt = self.streams[0].read(n)
  File "C:\dev\Python39\lib\site-packages\splunklib\results.py", line 151, in read
    response += c
MemoryError

Code example being executed:

...
service = client.connect(...)

query = "search index IN ... src IN ..."

kwargs = {
    "earliest_time": start,
    "latest_time": end,
    "search_mode": "normal",
    "enable_lookups": "false",
    "timeout": "604800",
}

rr = results.ResultsReader(service.jobs.export(query, **kwargs))

with open('results.csv', 'w+', newline="") as output:
    csvwriter = csv.DictWriter(output, fieldnames=['_time', '_raw'])
    csvwriter.writeheader()
    for result in rr:
        if isinstance(result, dict):
            csvwriter.writerow(result)
        elif isinstance(result, results.Message):
            # Diagnostic messages might be returned in the results
            print('%s: %s' % (result.type, result.message))
assert rr.is_preview == False

Splunk:

SDK:

ashah-splunk commented 2 years ago

@crumpetcrusher in our latest Python SDK Release 1.6.19 we have added a new feature of JSONResultsReader which is an improvement over the ResultsReader. We would suggest you to use JSONResultsReader as it has a better performance it would use lesser memory. Please let us know if this helps in your application. Note:- JSONResultsReader works along with the query param "output_mode" set to 'json'.

ashah-splunk commented 2 years ago

Closing the Issue as we haven't received any response. @crumpetcrusher Please reopen the Issue if you still face the error.