logstash-plugins / logstash-filter-kv

Apache License 2.0
17 stars 42 forks source link

performance regression since 4.1.2 #70

Closed colinsurprenant closed 6 years ago

colinsurprenant commented 6 years ago

There is a performance regression starting at version 4.1.2.

Using the config and sample data below I am getting these throughput number (run on a 2.9 GHz Core i7 MBP). The tests were run on LS 6.3.2.

Version EPS
4.0.3 41k
4.1.2 27k
4.2.0 27k

How to reproduce:

filter { kv { field_split => "\t" include_brackets => false source => "message" } }

output { stdout { codec => dots }}


- `sample.txt`

{"message":"host_name=Member1\tsys_name=AA-BB-CCCC\tdevTimeFormat=MMM dd yyyy HH:mm:ss Z\tdevTime=Aug 16 2018 09:10:20 +0300\tpolicy=Default_https-proxy-00\tdisp=Allow\tin_if=000-AAA_Internal\tout_if=111-BBB_External\tgeo_dst=USA\tip_len=314\tip_TTL=64\tproto=tcp\tsrc=1.2.3.4\tsrcPort=12345\tsrcPostNAT=1.2.3.4\tdst=1.2.3.4\tdstPort=123\ttcp_offset=5\ttcp_flag=A\ttcp_seq=123456789\ttcp_window=1281\tapp=HTTP Protocol over TLS SSL\tapp_cat=Network protocols\tapp_behavior=Access\tmsg=Application identified"}


- Command

$ yes cat sample.txt | bin/logstash -f sample.conf


I used the tool in https://github.com/elastic/logstash-benchmark-tools/tree/master/pq_blog to measure EPS.

Alternatively the `pv` command can be used to measure EPS from the dots codec output with:

$ yes cat sample.txt | bin/logstash -f sample.conf | pv -bart > /dev/null

yaauie commented 6 years ago

My current analysis indicates that the method I used to add support for field_split_pattern and value_split_pattern is at fault. Previously we used character classes to match separators and negated character classes to capture characters in a key or value, which is more performant than my implementation, which uses a single pattern for each that is either used directly or in a negative-lookahead followed by a single-character match.

I have a local branch that builds separate positive- and negative-patterns for the field_split and value_split cases that boosts performance on this particular input by ~1.7x, back into the ballpark of the pre-regression code, but drops support for field_split_pattern and value_split_pattern.

I'm currently working on finding an integrated approach that will continue to allow us to cleanly support the new field_split_pattern and value_split_pattern without this performance regression on simple field_split and value_split use-cases.