logstash-plugins / logstash-filter-grok

Grok plugin to parse unstructured (log) data into something structured.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
Apache License 2.0
124 stars 97 forks source link

Timeout executing grok '%{PATTERN}' against field 'message' with value 'Value too large to output (37951 bytes)...' #99

Closed magicdude4eva closed 7 years ago

magicdude4eva commented 7 years ago

Receiving the following warning:

2016-11-08T12:04:55,186][WARN ][logstash.filters.grok    ] Timeout executing grok '%{TRADE_ITEM_LOG}' against field 'message' with value 'Value too large to output (37951 bytes)! First 255 chars are: L/L 08/11/16 10:47:30 tcbobe-17 Solr QueryResponse : {responseHeader={status=0,QTime=6,params={facet.field=[CategoryId, Type, MajorLocationId, ConditionId, VerifiedUser, UserStore, PaymentOption],fl=*,score,start=0,facet.missing=true,fq=[Status:1, Category'!
[

Grok-pattern causing the timeout (pattern file attached=patternfile.txt)

        grok {
            patterns_dir => [ "/etc/logstash/patterns.d" ]
            match => { message => [
                "%{TRADE_ITEM_LOG}",
                "%{SOLR_QUERY_STATS}",
                "%{SOLR_SLOW_QUERY}",
                "%{TRADE_ITEM_TRACE}",
                "%{BOBE_TOMCATLOG}",
                "%{BOBE_CATALINALOG}",
                "%{BOBE_LOG_WITHIP}",
                "%{BOBE_LOG_WITHOUTIP}"
            ] }
        }

Log message causing the issue: logmessage.txt

magicdude4eva commented 7 years ago

I have a simplified example for the above:

Configuration in Logstash 5.0

              grok {
                patterns_dir => [ "/etc/logstash/patterns.d" ]
                add_tag => [ "drop_event" ]
                tag_on_failure => [ ]
                tag_on_timeout => [ "drop_event" ]
                match => { message => [
                  "%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@",
                  "%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} (Solr QueryResponse :|BobeSolrQuery :) "
                ] }
              }

I am puzzled about the timeout for the message logmessage-simple.txt

[2016-11-08T14:44:39,331][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@' against field 'message' with value 'Value too large to output (12099 bytes)! First 255 chars are: L/L 08/11/16 14:44:01 tcbobe-45 Solr QueryResponse : {responseHeader={status=0,QTime=7,params={facet.field=[CategoryId, Type, MajorLocationId, ConditionId, VerifiedUser, UserStore, PaymentOption],fl=*,score,start=0,facet.missing=true,fq=[Status:1, Category'!

as the regex is not even time-consumng and the message itself is 12K.

andrewvc commented 7 years ago

@magicdude4eva can you post the the full line that's being timed out? I'll need to be able to repro it to see what's going on.

magicdude4eva commented 7 years ago

It is in the attached logmessage-simple.txt. Messages similar to my example occur frequently, but I have not noticed the timeout message with every log message.

Perhaps there is some threading-issue with Logstash/Grok which could cause the issue. I don't know how Logstash/Grok works, but perhaps there is some work-queue/mutex which times out a filtered message even if the message itself is not being processed by grok yet (i.e. it is in the work-queue for grok or Logstash).

andrewvc commented 7 years ago

@magicdude4eva didn't time out for me. There's no work queue, but this could be caused by a busy server where the OS is keeping grokking threads idle for too long. Is the machine this happens on sometimes slammed with the CPUs saturated for extended periods?

I'm thinking due to the imprecision of time based limits we should have a much higher default than 2 seconds, like maybe as much as 60s by default.

andrewvc commented 7 years ago

@magicdude4eva have you tried upgrading the grok filter? You may have hit https://github.com/logstash-plugins/logstash-filter-grok/pull/98

magicdude4eva commented 7 years ago

@andrewvc I installed Logstash yesterday and did an update of all plugins. Grok is on 3.2.4:

logstash-filter-grok-3.2.4

The server in question is our pre-production server and load will be minimal. Some of our messages (such as the one causing the timeout) are quite verbose and can be up to 50K in size. I have since yesterday added the Ranges plugin where I drop any message over 15K.

With that, the amount of errors has dropped (i.e. only 4 in the last 8 hours). If I remove the Ranges plugin I would get errors more frequently.

I can not take Logstash 5 into production until the LS_NICE issue (https://github.com/elastic/logstash/issues/6216) is fixed. So right now I do not know if more load will increase the error rate.

andrewvc commented 7 years ago

@magicdude4eva in your pre-production server is the CPU often slammed? I'm thinking we should maybe raise the default if this is a timing issue. I don't see a smoking gun here for this been a logic issue. Have you seen any other errors related to grok in your logstash logs?

For now I would raise the timeout to something higher than 2000, maybe 60000. The main point of the timeout is to prevent Redos.

magicdude4eva commented 7 years ago

I do not believe it is a resource constraint or a peak in CPU utilisation. To me it looks like some sort of race-condition.

When you look at the graphs below, I can not see anything "sticking" out which could perhaps cause the timeout.

We had 5 errors yesterday. When errors occur, they seem to happen during the same time-frame - i.e. in this case between 9:17-9:19 and 11:43-11:44:

[2016-11-09T09:17:49,021][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@'
[2016-11-09T09:19:09,238][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@'
[2016-11-09T11:43:09,034][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@'
[2016-11-09T11:43:59,576][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@'
[2016-11-09T11:44:11,854][WARN ][logstash.filters.grok    ] Timeout executing grok '%{NOTSPACE}/%{NOTSPACE} %{DATESTAMP:timestamp} %{DATA:threadname} oracle.jdbc.driver.OraclePreparedStatementWrapper\@'

Looking at Graylog, we do not have any increase of messages during that time:

image

There is no abnormal CPU utilisation during that time: image

Nothing wrong with memory or CPU on the process during the time either: image

andrewvc commented 7 years ago

@magicdude4eva can you reproduce it? Have you seen it multiple times?

I agree that it might be a race, but its hard to spot a race without a smoking gun.

andrewvc commented 7 years ago

My other thought is that there could be GC pauses or the like holding up execution.

andrewvc commented 7 years ago

@magicdude4eva have you tried increasing the timeout by the way? What's it currently set to. I really do think that our default of 2s was poorly chosen. Threads can be paused for all sorts of reasons.

I'll review the code again today and see if I can spot a race.

thekofimensah commented 7 years ago

Hey guys, I'm also having a similar issue and I thought I'd post my error log also. I have the same specs as @magicdude4eva and running logstash as a service. This is my error:

[2016-11-10T11:35:23,304][ERROR][logstash.filters.grok ] Error while attempting to check/cancel excessively long grok patterns {:message=>"Mutex relocking by same thread", :class=>"ThreadError", :backtrace=>["org/jruby/ext/thread/Mutex.java:90:in lock'", "org/jruby/ext/thread/Mutex.java:147:insynchronize'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-3.2.3/lib/logstash/filters/grok/timeout_enforcer.rb:38:in stop_thread_groking'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-3.2.3/lib/logstash/filters/grok/timeout_enforcer.rb:53:incancel_timed_out!'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-3.2.3/lib/logstash/filters/grok/timeout_enforcer.rb:45:incancel_timed_out!'", "org/jruby/ext/thread/Mutex.java:149:in synchronize'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-3.2.3/lib/logstash/filters/grok/timeout_enforcer.rb:44:incancel_timed_out!'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-3.2.3/lib/logstash/filters/grok/timeout_enforcer.rb:63:in `start!'"]}

[2016-11-10T11:35:23,307][WARN ][logstash.filters.grok ] Timeout executing grok '%{EC}%{CODE:ErrorCode}%{SPACE}%{SEV}%{WORD:Severity}%{SPACE}%{CAT}%{GREEDYDATA:Category}%{SPACE}%{TIM}%{TIMESTAMP_ISO8601:Time}%{SPACE}%{MES}%{SPACE}%{MESSAGE:Summary}%{SPACE}%{EXC}%{MULTILINE:exceptionmess}%{SPACE}%{MACH}%{SERVER:Server}%{SPACE}%{APD}%{NOTSPACE:App Domain}%{SPACE}%{PID}%{INT:PID}%{SPACE}%{PRN}%{NOTSPACE:Process Name}%{SPACE}%{ATL}%{APPTYPE:App Type for Log}%{SPACE}%{WTI}%{NUMBER:ThreadID}%{SPACE}%{THN}%{SPACE}%{EXP}' against field 'message' with value 'Value too large to output (1669 bytes)! First 255 chars are: (I left the 255 first chars out)

From the second error you can see my long grok pattern: %{EC}%{CODE:ErrorCode}%{SPACE}%{SEV}%{WORD:Severity}%{SPACE}%{CAT}%{GREEDYDATA:Category}%{SPACE}%{TIM}%{TIMESTAMP_ISO8601:Time}%{SPACE}%{MES}%{SPACE}%{MESSAGE:Summary}%{SPACE}%{EXC}%{MULTILINE:exceptionmess}%{SPACE}%{MACH}%{SERVER:Server}%{SPACE}%{APD}%{NOTSPACE:App Domain}%{SPACE}%{PID}%{INT:PID}%{SPACE}%{PRN}%{NOTSPACE:Process Name}%{SPACE}%{ATL}%{APPTYPE:App Type for Log}%{SPACE}%{WTI}%{NUMBER:ThreadID}%{SPACE}%{THN}%{SPACE}%{EXP}

The logs that I am inputing though are not nearly as big as magicdude's, only about 1669 characters long and I am still getting that error. When I up my timeout to 60000 as you mentioned, the error goes away though.

magicdude4eva commented 7 years ago

@andrewvc would increasing the timeout on the thread reduce throughput or have any other negative impact? I am just concerned that this issue occurs on our staging environment and once we push into production it will just fall apart.

I am not able to reproduce the error, but I do see it frequently happening. It is completely unpredictable as said above, there seems to be no load issue or GC or anything related to be the root issue of it.

andrewvc commented 7 years ago

@magicdude4eva that's up to you in terms of whether it will decrease throughput. It doesn't make regexps slower, it just cuts them off sooner. It will only decrease throughput when you have very slow regexps.

andrewvc commented 7 years ago

@seanziee I think you have a different bug, the one from #98, please upgrade the grok plugin and see if that fixes it.

magicdude4eva commented 7 years ago

@andrewvc but then I think we can agree that the sample data provided and the regex running over it should never exceed a few miliseconds of execution time. Running the regex via Java-regex against a log-file takes a few ms per regex, so I am a bit puzzled that it would take more than 2 seconds (unless within Logstash/grok there is some mechanism of a work-queue which incorrectly times out grok-patterns before they actually start processing).

Is there perhaps a way to add more logging? I just don't understand how a thread could take that long as the regex is not complex and the data is very small.

jordansissel commented 7 years ago

I think @andrewvc mentioned this earlier, but it's possible that Logstash isn't getting enough CPU resources to spend that few-ms on grok. This could cause 2 seconds to pass by without grok having completed its work for a single event.

There could be a bug, still, but we haven't yet ruled out a capacity problem on your system, have we?

On Thursday, November 10, 2016, Gerd Naschenweng notifications@github.com wrote:

@andrewvc https://github.com/andrewvc but then I think we can agree that the sample data provided and the regex running over it should never exceed a few miliseconds of execution time. Running the regex via Java-regex against a log-file takes a few ms per regex, so I am a bit puzzled that it would take more than 2 seconds (unless within Logstash/grok there is some mechanism of a work-queue which incorrectly times out grok-patterns before they actually start processing).

Is there perhaps a way to add more logging? I just don't understand how a thread could take that long as the regex is not complex and the data is very small.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/logstash-plugins/logstash-filter-grok/issues/99#issuecomment-259899382, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIC6jQxufh0Gh7_qBvF4kBLntIN5gp3ks5q9BuZgaJpZM4KsP_Y .

magicdude4eva commented 7 years ago

No capacity problem - today one event at 10:48:28:

[2016-11-11T10:48:28,174][WARN ][logstash.filters.grok    ] Timeout executing grok

Graylog was not busy: image

No resource issues during the time: image

No issue with the Logstash process: image

andrewvc commented 7 years ago

@magicdude4eva this is something I'd like to fix, but I won't be able to get to it till next week. In the interim could you package this stuff up as a reproducible test case?

magicdude4eva commented 7 years ago

I have just upgraded to logstash-filter-grok 3.3.0 and I think the only way to perhaps have some reproducible is to package up the logstash config and one of the log-files and then see if it can be reproduced.

I am not sure if it would make sense to enable debug logging?

adyrcz commented 7 years ago

I was able to get the same error. Well I got the same error. Posted my question on stackexchange. http://stackoverflow.com/questions/40618564/logstash-value-too-large-to-output I'm running 5.0.0-1

magicdude4eva commented 7 years ago

@adyrcz your issue seems to be related to the mutex issue in https://github.com/logstash-plugins/logstash-filter-grok/pull/98

andrewvc commented 7 years ago

@magicdude4eva if you could pack stuff up that way that would be amazing.

magicdude4eva commented 7 years ago

@andrewvc I have upgraded to 5.0.1 and have not noticed the timeouts yet. I have started last night to upgrade Logstash on our low-volume production servers and so far no issue.

This morning I will upgrade one of our high-volume nodes to 5.0.1 and keep you updated.

magicdude4eva commented 7 years ago

I think this issue can be closed. It seems that with the upgrade to 5.0.1 something fixed the problem. Over the last 24 hours we processed 15m messages across 6 servers and none of the servers threw up the error.

andrewvc commented 7 years ago

Thanks for the feedback @magicdude4eva

dilchenko commented 6 years ago

Still observing this problem.

logstash-6.2.2-1.noarch logstash-filter-grok (4.0.2)

Breaks on something relatively simple:

# grep 'Timeout executing grok' /var/log/logstash/logstash-plain-2018-04-*.log  | grep 'r\.suppressed'
/var/log/logstash/logstash-plain-2018-04-18.log:[2018-04-18T01:07:54,937][WARN ][logstash.filters.grok    ] Timeout executing grok '^%{ES_TRANSPORTPUTMAPPINGACTION}' against field 'message' with value 'Value too large to output (17701 bytes)! First 255 chars are: [2018-04-18T01:02:57,064][WARN ][r.suppressed             ] path: /, params: {}

The telemetry data leaves some doubt about really short-lived CPU spikes that could potentially cause this. I am curious what happens to the log message in this case - is it just not parsed, or dropped altogether?

I will try disabling the timeout.