StreamingCommand failed when input contains non-ascii character

Environment:

splunk8.0
splunk sdk 1.6.11
python 3

When I create StreamingCommand "testcommand" with following code:

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys
from splunklib.searchcommands import \
    dispatch, StreamingCommand, Configuration, Option, validators
import splunk
@Configuration()   
class TestCommand(StreamingCommand):
    def stream(self, events):   
        for event in events:        
            yield event
dispatch(TestCommand, sys.argv, sys.stdin, sys.stdout, __name__)

testcommand will hang when I use following SPL: sourcetype=XXX| search url = "http://例子.卷筒纸" | testcommand

There has follwing error log in splunkd.log:

10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: During handling of the above exception, another exception occurred: 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: Traceback (most recent call last): 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\testcommand.py", line 22, in 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: dispatch(TestCommand, sys.argv, sys.stdin, sys.stdout, name) 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\search_command.py", line 1118, in dispatch 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: command_class().process(argv, input_file, output_file) 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\search_command.py", line 435, in process 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: self._process_protocol_v2(argv, ifile, ofile) 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\search_command.py", line 787, in _process_protocol_v2 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: self.finish() 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\search_command.py", line 393, in finish 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: self._record_writer.flush(finished=True) 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\internals.py", line 775, in flush 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: self._write_chunk(metadata, self._buffer.getvalue()) 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: File "C:\Program Files\Splunk\etc\apps\XXXX\bin\splunklib\searchcommands\internals.py", line 820, in _write_chunk 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: self._ofile.flush() 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: OSError: [Errno 22] Invalid argument 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: Exception ignored in: <_io.TextIOWrapper name='' mode='w' encoding='utf-8'> 10-25-2019 16:34:42.808 +0800 ERROR ChunkedExternProcessor - stderr: OSError: [Errno 22] Invalid argument

When previous output contains non-ascii character, it seems this StreamingCommand blocked in following code "ifile.read(body_length)"

try:
            if body_length > 0:
                body = ifile.read(body_length)
        except Exception as error:
            raise RuntimeError('Failed to read body of length {}: {}'.format(body_length, error))

in function "_read_chunk " in class SearchCommand.

When I use search command protocol version 1, this issue is not replicate

[testcommand]
filename=testcommand.py
enableheader = true
outputheader = true
requires_srinfo = true
stderr_dest = message
supports_getinfo = true
supports_rawargs = true
supports_multivalues = true

I actually JUST encountered this issue! It took me forever to narrow down what was happening. I'm not sure if it's a problem with the sdk so much as it is with splunkd.

I have a command I'm doing a POC with, and I tested it with: | makeresults | eval string="hey guys" | formatasjson string which comes through stdin as: chunked 1.0,36,66 {"action":"execute","finished":true} "_time",string,"__mv__time","__mv_string"\n1590694186,"hey guys",,

But then when I run the same search as with a non-ascii character replacement: | makeresults | eval string="héy guys" | formatasjson string I get an +1 content length from the header: chunked 1.0,36,67 {"action":"execute","finished":true} "_time",string,"__mv__time","__mv_string"\n1590694057,"héy guys",,

The sdk (quite naturally) hangs at: splunklib/searchcommands/search_command.py: 888. try: 889. ....if body_length > 0: 890. ........body = ifile.read(body_length)

Because it's waiting for that 67th character that will never come.

Right now I'm just trying to make a band-aid approach to monitor for a "deceptive header." I'm not sure yet if it's one-off for every one non-ascii character, or if it's just one-off period. This is just an uninformed guess, but I'm wondering if it has to do with Splunk using Python 2 (which requires the "u" prefix before a unicode string). That's the only single-character difference I can think of between ascii and unicode at the moment, but I wouldn't know why Splunk would be using a repr operation on a string.

--EDIT-- Unfortunately there does seem to be +1 on the body length for each unicode character...

0 Unicode Characters: | makeresults | eval string="hey guys" | formatasjson string Returns an accurate 66 character body length: chunked 1.0,36,66 {"action":"execute","finished":true} "_time",string,"__mv__time","__mv_string"\n1590694186,"hey guys",,

1 Unicode Character: | makeresults | eval string="héy guys" | formatasjson string Returns a +1 character (67) body length: chunked 1.0,36,67 {"action":"execute","finished":true} "_time",string,"__mv__time","__mv_string"\n1590694057,"héy guys",,

2 Unicode Characters: | makeresults | eval string="héy güys" | formatasjson string Returns a +2 character (68) body length: chunked 1.0,36,68 {"action":"execute","finished":true} "_time",string,"__mv__time","__mv_string"\n1590695753,"héy güys",,

I actually JUST encountered this issue! ... ... I'm just trying to make a band-aid approach ...

This worked for me: splunklib/searchcommands/search_command.py: 888. try: - 889. ....if body_length > 0: + 889. ....body_saved = body_length - 890. ........body = ifile.read(body_length) + 890. ........while body_saved > 0: + 891. ............char = ifile.read(1) + 892. ............body_saved -= 2 if char.encode('ascii', errors='ignore') == b'' else 1 + 893. ............body += char

Sorry that's not very copy/paste friendly. Basically I just changed the ifile.read(body_length) line (and the preceding line) to: <EDITED>

body_saved = body_length
while body_saved > 0:
    char = ifile.read(1)
    body_saved -= len(bytes(char, encoding='utf-8'))
    body += char

</EDITED>

After some testing, I was able to determine that Splunk will consistently add +1 to the body_length value in the header for every non-ascii character. (ascii_char = 1, unicode_char = 2) When I made this fix it became stable and functional.

I can't think of anything that doesn't involve reading it in character by character. I played around with a timeout, but even that needed to read it in one at a time. There's not one-size fits all timeout that ends up being more performant than reading it in that way. (And a read(length) timeout that fails would have to fall back on character by character at that point regardless)

My personal next step is to try and open up a question with Splunk itself on fixing their header bug, but with this I can move on developing my app at least.

EDITED (NOTE) More research, more tests, and I realized that the length discrepancy has to do with the bytes required per character in the body. Simple ascii characters are obviously a single byte, whereas unicode characters are typically 2 bytes. There are, however, characters like 基 that are three bytes. I changed the code to evaluate the byte-length coming through each character in the body. This (hopefully) will be an unnecessary step when Splunk goes to Python3, but this particular tweak makes me feel better because it seems to attack the true cause, so it should be the most stable strategy for now.

Just spent way too long debugging this myself.

The issue is that Splunk platform sends the header in BYTES, but splunklib uses it to read from a TextIOWrapper, which takes CHARS.

So characters that are more than 1 byte (can be 2 or 3 bytes or more) cause the body_length sent to splunklib to be longer in BYTES than in CHARS. Then when splunklib does .read(), it ought to hit the EOF and return, but it doesn't and just hangs forever.

Sample string that triggers the bug: "Hi 接" It's 4 characters, but 6 bytes. Splunk sends a body_length of 6. Splunklib tries to read 6 CHARACTERS from the stdin (TextIOWrapper), but hangs because there are only 4 characters. The SPL pipeline hangs and uses its memory and disk space forever.

splunk / splunk-sdk-python

StreamingCommand failed when input contains non-ascii character #288