splunk / splunk-sdk-python

Splunk Software Development Kit for Python
http://dev.splunk.com
Apache License 2.0
687 stars 369 forks source link

Bug Report: Error Handling Large CSV Fields in Splunk SDK #561

Open Catsofsuffering opened 6 months ago

Catsofsuffering commented 6 months ago

The bug I found and how to repair it When developing a threat hunting application, I encountered a bug located at line 948 of splunklib\searchcommands\search_command.py. The relevant code snippet is as follows:

def _read_csv_records(self, ifile):
    reader = csv.reader(ifile, dialect=CsvDialect)

    try:
        fieldnames = next(reader)
    except StopIteration:
        return

The bug arises due to the use of the Python csv package and the reader function. This leads to an error occurring during the processing of large amounts of data:

Error: field larger than field limit (131072)

After conducting a Google search, I found a solution on Stack Overflow (https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072). Implementing the following code snippet resolved the issue:

csv.field_size_limit(sys.maxsize)

Splunk:

SDK:

Additional context Are there any risks or issues associated with my approach?

ashah-splunk commented 6 months ago

@Catsofsuffering Can you please provide the steps to reproduce this issue?

Catsofsuffering commented 6 months ago

@Catsofsuffering Can you please provide the steps to reproduce this issue?

Due to our company's data security policy, I am unable to directly provide screenshots or logs to you. However, I can briefly describe the background and cause of this bug:

As mentioned earlier, I have developed an app that matches IOC threat intelligence, which requires sending a large amount of URL data to the corresponding API and then importing the returned results into Splunk. During this process, it seems that the error Error: field larger than field limit (131072) occurred because the returned data (approximately close to 1 billion events) exceeded the limit of the csv.reader.

Although only about 40,000 records are filtered on average in the end, this issue still occurs. I'm not sure if this is a bug or an area that needs optimization, so I would like to consult you teams. Thank you.