logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 98 forks source link

Why is this filter "slow"? A brief analysis #76

Open jsvd opened 8 years ago

jsvd commented 8 years ago

Total time to process the same apache log line 1_000_000 times with %{COMMONAPACHELOG}: 93.88s (~10k e/s)

The 6 biggest time offenders use 70.46s - 75% of total time:

Action Time (s) % of Total Times called per event
Grok#handle (self) 21.16 22% 10x
Regexp#match 19.48 20% 1x
Event#[]= 15.58 16% 9x
Event#[] 8.51 9% 10x
Logger.debug? 5.73 6% 3x
Proc#call 4.54 4% 11x

Note: this is using:


Ruby script used to test:

# encoding: utf-8
require 'jruby/profiler'

module LogStash
end

module LogStash::Environment
  # running the grok code outside a logstash package means
  # LOGSTASH_HOME will not be defined, so let's set it here
  # before requiring the grok filter
  unless self.const_defined?(:LOGSTASH_HOME)
    LOGSTASH_HOME = File.expand_path("../../../", __FILE__)
  end

  # also :pattern_path method must exist so we define it too
  unless self.method_defined?(:pattern_path)
    def pattern_path(path)
      ::File.join(LOGSTASH_HOME, "patterns", path)
    end
  end
end

require 'logstash/event'
require 'logstash/environment'
require 'logstash/filters/grok'
require 'logstash/codecs/base'

filter = LogStash::Filters::Grok.new(
  "match" => { "message" => "%{COMMONAPACHELOG}" },
)
filter.register

print "generating data.."
data = 1_000_000.times.map do
  message = '198.46.149.143 - - [04/Jun/2015:02:29:31 +0000] "GET /blog/geekery/solving-good-or-bad-problems.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+semicomplete%2Fmain+%28semicomplete.com+-+Jordan+Sissel%29 HTTP/1.1" 200 10756 "-" "Tiny Tiny RSS/1.11 (http://tt-rss.org/)"'
  LogStash::Event.new("message" => message)
end
puts "done. benchmarking..."

profile_data = JRuby::Profiler.profile do
  data.each do |event|
    filter.filter(event)
  end
end
profile_printer = JRuby::Profiler::GraphProfilePrinter.new(profile_data)
profile_printer.printProfile(STDOUT)

The full profile.graph output can be found here

jsvd commented 8 years ago

On a note: JRuby 9.0.5.0 performs similarly, if not slightly faster (after working around https://github.com/jruby/jruby/issues/3715):

main profile results:
Total time: 88.77

     total        self    children       calls  method
----------------------------------------------------------------
     88.77        0.46       88.31     4000001  Array#each
     88.31        1.11       87.20     1000000  LogStash::Filters::Grok#filter
     80.29        1.50       78.79     2000000  Hash#each
     78.79        1.41       77.39     1000000  LogStash::Filters::Grok#match
     75.69        1.08       74.62     1000000  LogStash::Filters::Grok#match_against_groks
     72.80        0.81       71.99     1000000  Grok#match_and_capture
     39.77        7.88       31.89    11000000  Proc#call
     30.60       30.60        0.00     1000000  Regexp#match
     25.34       11.93       13.41    10000000  LogStash::Filters::Grok#handle
      6.84        6.84        0.00     9000000  LogStash::Event#[]=
      5.17        5.17        0.00    10000000  LogStash::Event#[]
      4.32        3.81        0.51     3000000  Cabin::Mixins::Logger.debug?
      4.28        2.49        1.79     1000000  LogStash::Filters::Base#filter_matched
      1.10        1.10        0.00    10000000  MatchData#[]
      0.74        0.74        0.00    10000000  Kernel.kind_of?
...
jsvd commented 8 years ago

On a side note, big props to @colinsurprenant for the logstash-core-event-java. In my tests using the old ruby version of logstash-core-event made the []/[]= methods take a bigger slice of the whole, making Regexp#match use only ~10/12% of the time.