Closed colinsurprenant closed 6 years ago
My current analysis indicates that the method I used to add support for field_split_pattern
and value_split_pattern
is at fault. Previously we used character classes to match separators and negated character classes to capture characters in a key or value, which is more performant than my implementation, which uses a single pattern for each that is either used directly or in a negative-lookahead followed by a single-character match.
I have a local branch that builds separate positive- and negative-patterns for the field_split
and value_split
cases that boosts performance on this particular input by ~1.7x, back into the ballpark of the pre-regression code, but drops support for field_split_pattern
and value_split_pattern
.
I'm currently working on finding an integrated approach that will continue to allow us to cleanly support the new field_split_pattern
and value_split_pattern
without this performance regression on simple field_split
and value_split
use-cases.
There is a performance regression starting at version 4.1.2.
Using the config and sample data below I am getting these throughput number (run on a 2.9 GHz Core i7 MBP). The tests were run on LS 6.3.2.
How to reproduce:
sample.conf
filter { kv { field_split => "\t" include_brackets => false source => "message" } }
output { stdout { codec => dots }}
{"message":"host_name=Member1\tsys_name=AA-BB-CCCC\tdevTimeFormat=MMM dd yyyy HH:mm:ss Z\tdevTime=Aug 16 2018 09:10:20 +0300\tpolicy=Default_https-proxy-00\tdisp=Allow\tin_if=000-AAA_Internal\tout_if=111-BBB_External\tgeo_dst=USA\tip_len=314\tip_TTL=64\tproto=tcp\tsrc=1.2.3.4\tsrcPort=12345\tsrcPostNAT=1.2.3.4\tdst=1.2.3.4\tdstPort=123\ttcp_offset=5\ttcp_flag=A\ttcp_seq=123456789\ttcp_window=1281\tapp=HTTP Protocol over TLS SSL\tapp_cat=Network protocols\tapp_behavior=Access\tmsg=Application identified"}
$ yes
cat sample.txt
| bin/logstash -f sample.conf$ yes
cat sample.txt
| bin/logstash -f sample.conf | pv -bart > /dev/null