Custom search command support for multibyte characters in Python 3

amysutedja commented 4 years ago

Fixes https://github.com/splunk/splunk-sdk-python/issues/288

`SearchCommand` supports multibyte characters in Python 3

Previously, SearchCommand in Python 3 would read directly from the incoming ifile stream -- typically sys.stdin. In Python 2 sys.stdin is a file-like byte stream, whereas in Python 3 it is an io.TextIOWrapper containing an underlying buffer. Because this object reads by character rather than by byte, multibyte characters would cause the command to read too far past the data's boundary. This could lead to corrupt data reads (if early in the stream) or infinite hangs (if at the end of the stream).

We now retrieve the underlying buffer and read from it when in Python 3. The read bytes are then cast to strings for parsing purposes.

Tests ensure underlying byte stream

Previously, the tests defined a metadata stream with the Ａ character in it (not to be confused with A). In Python 2, this character caused its containing string to become unicode, which caused StringIO to gain that encoding. As a result, the size of the metadata stream was always incorrectly measuring Unicode characters rather than bytes, but under test the read logic would always be handed a Unicode character stream rather than a byte stream.

This has been fixed.

Multibyte test fixture

We now have a new test test_multibyte_chunked which contains a multibyte character test fixture.

ichaer-splunk commented 4 years ago

Hah, I've been working since yesterday in a change that is nearly identical to this. Nice fix, if I say so myself =D

I'll retire my fork, this is really good.

tristan-splunk commented 4 years ago

FYI we are waiting on this change to upgrade the pythonsdk in all our apps ahead of conf. We'd rather not have to do a monkey patch. Please advise when you can merge and release an official build.

splunk / splunk-sdk-python