graylog-labs / graylog-plugin-riemann

Graylog2 output plugin to send events to a Riemann instance
Apache License 2.0
7 stars 1 forks source link

Plugin Slowing Graylog Message Processing #3

Open peteGS opened 9 years ago

peteGS commented 9 years ago

I started using the Riemann output plugin to overcome some limitations with Graylog's alerting mechanism and while it works spectacularly for what I need, it also appears to significantly impact the message processing on all Graylog server nodes.

Each of my nodes drops from a maximum of ~2500 messages/second down to at best about 300 per second.

I've tried both TCP and UDP protocols in case there were TCP connections slowing it down, but it doesn't seem to matter at all.

I am still successfully using the plugin in my test environment but the messages per sec is only a tiny proportion of what is in Production (around 100 vs. sustained ~10,000). I first noticed the issue when I had an individual host generating a lot more messages than normal which had up to almost 40,000 messages per second and I noticed the journals growing.

I had the plugin associated with a "catch all" stream so I could send all output to Riemann while I determine what needed to be alerted on. This stream has a single rule determine the presence of the "message" field. I use this stream to assign read privileges to other staff on all messages.

I have 10 Graylog servers on CentOS 6.6 running Graylog 1.1.3. The Java in use is OpenJDK 1.8.0_45 (64 bit). Each Graylog server is running as a VM on top of ESXi 5.5 with 8 vCPU's and 16GB RAM. Java heap size is set to 8GB. I'm using the Riemann plugin version 1.0.2.

When the plugin is enabled I see no load increase or errors logged anywhere, the message processing simply slows dramatically. Increasing processors makes no difference.

There's quite a varied number of messages sent to Graylog including Windows Event logs, IIS and FTP logs, ESXi and other application logs via Syslog, vCenter logs, and so forth.

I have attached some screen dumps of a couple of sample messages, apologies if that's not a good format as I couldn't think of any other way to get them easily.

Let me know any other information you require.

Cheers, Pete gl message 1 gl message 2 - part 1 gl message 2 - part 2 gl message 2 - part 3

mariussturm commented 9 years ago

Hi Pete, could please try 1.1.1 of the plugin: https://github.com/Graylog2/graylog2-plugin-output-riemann/releases You can now set a batch size for TCP transfers, this should speedup event writes.

peteGS commented 9 years ago

Thanks Marius, I've downloaded the new plugin and will get it installed and the output configured again later today.

I'll let you know how it looks.

Cheers, Pete

On Wed, Jul 15, 2015 at 10:52 PM, Marius Sturm notifications@github.com wrote:

Hi Pete, could please try 1.1.1 of the plugin: https://github.com/Graylog2/graylog2-plugin-output-riemann/releases

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-plugin-output-riemann/issues/3#issuecomment-121605489 .

peteGS commented 9 years ago

Hi Marius,

I got the new plugin installed today and there are definitely improvements in performance,however most nodes have definitely still dropped in the amount of messages/second processed. A few are still up around ~1500 which seems about right for now but most are in the low 100's which is definitely a drop since I enabled the output again.

Also, I had to switch the output to TCP this time as enabling UDP gives me a transport error on the Riemann server: I'd prefer to use UDP for this.

WARN [2015-07-17 11:59:42,106] defaultEventExecutorGroup-2-1 - riemann.transport.udp - UDP handler caught io.netty.handler.codec.DecoderException: com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length. at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at riemann.transport.proxy$io.netty.handler.codec.MessageToMessageDecoder$ff19274a.channelRead(Unknown Source) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:93) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745)

Cheers, Pete

On Thu, Jul 16, 2015 at 6:08 AM, Pete GS starpoint73@gmail.com wrote:

Thanks Marius, I've downloaded the new plugin and will get it installed and the output configured again later today.

I'll let you know how it looks.

Cheers, Pete

On Wed, Jul 15, 2015 at 10:52 PM, Marius Sturm notifications@github.com wrote:

Hi Pete, could please try 1.1.1 of the plugin: https://github.com/Graylog2/graylog2-plugin-output-riemann/releases

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-plugin-output-riemann/issues/3#issuecomment-121605489 .

mariussturm commented 9 years ago

The UDP problem can happen when the UDP kernel receive buffer overflows. Thats another kind of problem. Let's focus on the Tcp throughput for now. Could you please post your Riemann config that I can reproduce your setup? Another thing would be to remove everything from the stream() configuration to see if Riemann is the limiting factor or the plugin. Please run this with different TCP batch sizes like, 2500, 5000, 10000..., maybe we can get some informations there.

mariussturm commented 9 years ago

In the meantime there was also an update to the underlying riemann client, this is integrated in v1.1.2 https://github.com/Graylog2/graylog2-plugin-output-riemann/releases/tag/v1.1.2

peteGS commented 9 years ago

Thanks Marius, I'll try to get the new plugin installed today and do some further testing,

In the meantime, here's my Riemann config:

/etc/riemann/riemann.config (just censored the mail server/address)

; -- mode: clojure; -- ; vim: filetype=clojure

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on the local interface over TCP (5555), UDP (5555), and websockets ; (5556) (let [host "0.0.0.0"](tcp-server {:host host}) ; (udp-server {:host host}) (ws-server {:host host}))

; Expire old events from the index every 5 seconds. (periodically-expire 5)

; Mailer definition for Riemann (def email (mailer {:host "xxxxxx" :from "xxxxxx"}))

; Graphite definition for Riemann to point to local Graphite instance (def graph (graphite {:host "localhost"}))

(include "/etc/riemann/conf.d")

/etc/riemann/conf.d/graylog_metrics.clj

; This one groups by host to simply graph messages per second for each host (let [index (index)](streams %28smap %28fn [event] %28assoc event :time %28unix-time%29%29%29 %28by :host %28with {:service "host_messages_per_second" :metric 1} %28rate 1 index graph))))) ; This one simply graphs total messages per second (streams (smap (fn [event](assoc event :time %28unix-time%29)) (with {:service "total_messages_per_second" :host "mulcher" :metric 1} (rate 1 index graph

(info (:host %) (:service %) (:metric %))

    ))))

)

/etc/riemann/conf.d/windows_event_monitor.clj

; Calculates the rate over 5 minutes of Windows events that are either not INFO or are AUDIT_FAILURE (let [index (index)](streams %28where %28and %28= %28:SourceModuleType event%29) (or (= (:EventType event) "AUDIT_FAILURE") (not= (:Severity event) "INFO"))) (by [:host :EventID](smap %28fn [event] %28assoc event :service %28:EventID event%29%29%29 %28with {:metric 1} %28rate 300 ;#%28info %28:host %%29 %28:service %%29 %28:EventID %%29 %28:metric %%29%29 ;%28over 50 %28email)) index graph )))))) )

Cheers, Pete

On Fri, Jul 17, 2015 at 9:12 PM, Marius Sturm notifications@github.com wrote:

In the meantime there was also an update to the underlying riemann client, this is integrated in v1.1.2 https://github.com/Graylog2/graylog2-plugin-output-riemann/releases/tag/v1.1.2

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-plugin-output-riemann/issues/3#issuecomment-122248791 .

peteGS commented 9 years ago

Hi Marius,

I commented out all my streams lines but still the same behaviour, message processing drops to about half (roughly).

I tried batch sizes of 500, 1000, and 2000 with the same results.

I still see no difference in load average on the Graylog servers so the plugin doesn't appear to be adding extra CPU load.

Cheers, Pete

On Mon, Jul 20, 2015 at 8:13 AM, Pete GS starpoint73@gmail.com wrote:

Thanks Marius, I'll try to get the new plugin installed today and do some further testing,

In the meantime, here's my Riemann config:

/etc/riemann/riemann.config (just censored the mail server/address)

; -- mode: clojure; -- ; vim: filetype=clojure

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on the local interface over TCP (5555), UDP (5555), and websockets ; (5556) (let [host "0.0.0.0"](tcp-server {:host host}) ; (udp-server {:host host}) (ws-server {:host host}))

; Expire old events from the index every 5 seconds. (periodically-expire 5)

; Mailer definition for Riemann (def email (mailer {:host "xxxxxx" :from "xxxxxx"}))

; Graphite definition for Riemann to point to local Graphite instance (def graph (graphite {:host "localhost"}))

(include "/etc/riemann/conf.d")

/etc/riemann/conf.d/graylog_metrics.clj

; This one groups by host to simply graph messages per second for each host (let [index (index)](streams %28smap %28fn [event] %28assoc event :time %28unix-time%29%29%29 %28by :host %28with {:service "host_messages_per_second" :metric 1} %28rate 1 index graph))))) ; This one simply graphs total messages per second (streams (smap (fn [event](assoc event :time %28unix-time%29)) (with {:service "total_messages_per_second" :host "mulcher" :metric 1} (rate 1 index graph

(info (:host %) (:service %) (:metric %))

    ))))

)

/etc/riemann/conf.d/windows_event_monitor.clj

; Calculates the rate over 5 minutes of Windows events that are either not INFO or are AUDIT_FAILURE (let [index (index)](streams %28where %28and %28= %28:SourceModuleType event%29) (or (= (:EventType event) "AUDIT_FAILURE") (not= (:Severity event) "INFO"))) (by [:host :EventID](smap %28fn [event] %28assoc event :service %28:EventID event%29%29%29 %28with {:metric 1} %28rate 300 ;#%28info %28:host %%29 %28:service %%29 %28:EventID %%29 %28:metric %%29%29 ;%28over 50 %28email)) index graph )))))) )

Cheers, Pete

On Fri, Jul 17, 2015 at 9:12 PM, Marius Sturm notifications@github.com wrote:

In the meantime there was also an update to the underlying riemann client, this is integrated in v1.1.2 https://github.com/Graylog2/graylog2-plugin-output-riemann/releases/tag/v1.1.2

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-plugin-output-riemann/issues/3#issuecomment-122248791 .

peteGS commented 8 years ago

Hi Marius,

Any further ideas on this at all?

Cheers, Pete

On Mon, Jul 20, 2015 at 3:46 PM, Pete GS starpoint73@gmail.com wrote:

Hi Marius,

I commented out all my streams lines but still the same behaviour, message processing drops to about half (roughly).

I tried batch sizes of 500, 1000, and 2000 with the same results.

I still see no difference in load average on the Graylog servers so the plugin doesn't appear to be adding extra CPU load.

Cheers, Pete

On Mon, Jul 20, 2015 at 8:13 AM, Pete GS starpoint73@gmail.com wrote:

Thanks Marius, I'll try to get the new plugin installed today and do some further testing,

In the meantime, here's my Riemann config:

/etc/riemann/riemann.config (just censored the mail server/address)

; -- mode: clojure; -- ; vim: filetype=clojure

(logging/init {:file "/var/log/riemann/riemann.log"})

; Listen on the local interface over TCP (5555), UDP (5555), and websockets ; (5556) (let [host "0.0.0.0"](tcp-server {:host host}) ; (udp-server {:host host}) (ws-server {:host host}))

; Expire old events from the index every 5 seconds. (periodically-expire 5)

; Mailer definition for Riemann (def email (mailer {:host "xxxxxx" :from "xxxxxx"}))

; Graphite definition for Riemann to point to local Graphite instance (def graph (graphite {:host "localhost"}))

(include "/etc/riemann/conf.d")

/etc/riemann/conf.d/graylog_metrics.clj

; This one groups by host to simply graph messages per second for each host (let [index (index)](streams %28smap %28fn [event] %28assoc event :time %28unix-time%29%29%29 %28by :host %28with {:service "host_messages_per_second" :metric 1} %28rate 1 index graph))))) ; This one simply graphs total messages per second (streams (smap (fn [event](assoc event :time %28unix-time%29)) (with {:service "total_messages_per_second" :host "mulcher" :metric 1} (rate 1 index graph

(info (:host %) (:service %) (:metric %))

    ))))

)

/etc/riemann/conf.d/windows_event_monitor.clj

; Calculates the rate over 5 minutes of Windows events that are either not INFO or are AUDIT_FAILURE (let [index (index)](streams %28where %28and %28= %28:SourceModuleType event%29) (or (= (:EventType event) "AUDIT_FAILURE") (not= (:Severity event) "INFO"))) (by [:host :EventID](smap %28fn [event] %28assoc event :service %28:EventID event%29%29%29 %28with {:metric 1} %28rate 300 ;#%28info %28:host %%29 %28:service %%29 %28:EventID %%29 %28:metric %%29%29 ;%28over 50 %28email)) index graph )))))) )

Cheers, Pete

On Fri, Jul 17, 2015 at 9:12 PM, Marius Sturm notifications@github.com wrote:

In the meantime there was also an update to the underlying riemann client, this is integrated in v1.1.2 https://github.com/Graylog2/graylog2-plugin-output-riemann/releases/tag/v1.1.2

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-plugin-output-riemann/issues/3#issuecomment-122248791 .

mariussturm commented 8 years ago

We have a high interest in solving this but currently I don't see an obvious solution. The plugin is not doing that much, basically just transforming a log message to a Riemann event. This should be super fast. All the networking options are already set for maximum speed, like batch writes and skip of additional acknowledgements in the protocol.

Just to be super clear, there is only one jar file of the plugin in your server's plugin dir and that's the 1.1.2 release, right?

peteGS commented 8 years ago

Thanks Marius, and yes I could see no reason for it to reduce the processing so drastically, especially when I see no load increase on the servers.

And yes, only the one jar in the plugin directory and it's version 1.1.2.

The only thing I haven't done recently is upgrade to Graylog 1.1.6, I'm still on 1.1.4 at present. I'm hoping to have time to do that this week at some point.

Cheers, Pete

On Fri, Aug 7, 2015 at 6:50 PM, Marius Sturm notifications@github.com wrote:

We have a high interest in solving this but currently I don't see an obvious solution. The plugin is not doing that much, basically just transforming a log message to a Riemann event. This should be super fast. All the networking options are already set for maximum speed, like batch writes and skip of additional acknowledgements in the protocol.

Just to be super clear, there is only one jar file of the plugin in your server's plugin dir and that's the 1.1.2 release, right?

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog-plugin-output-riemann/issues/3#issuecomment-128644075 .

mariussturm commented 8 years ago

Hi Pete, could please give this SNAPSHOT a try: https://github.com/Graylog2/graylog-plugin-output-riemann/releases/tag/v1.1.3-SNAPSHOT

A bug in the riemann-java client was fixed that could explain your problem.

Thanks, Marius

peteGS commented 8 years ago

Hi Marius,

That could be good news! It'll be a week or so before I'll get to test as I'm taking a week off work next week, however I'm excited to try it and I will let you know as soon as I've been able to.

Cheers, Pete

On Thu, Aug 20, 2015 at 11:30 PM, Marius Sturm notifications@github.com wrote:

Hi Pete, could please give this SNAPSHOT a try: https://github.com/Graylog2/graylog-plugin-output-riemann/releases/tag/v1.1.3-SNAPSHOT

A bug in the riemann-java client was fixed that could explain your problem.

Thanks, Marius

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog-plugin-output-riemann/issues/3#issuecomment-133008248 .

dch commented 7 years ago

@peteGS any feedback on whether this issue is resolved with the newer release?

peteGS commented 7 years ago

Hi Dave, my apologies but I never get to revisit this as work got way too busy with other things.

I do hope to get back to it soon though as Reimann is the one tool I could see to meet a few requirements that we have.

Cheers, Pete

On 13 Sep 2016, at 18:27, Dave Cottlehuber notifications@github.com wrote:

@peteGS any feedback on whether this issue is resolved with the newer release?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

peteGS commented 7 years ago

Sincerest apologies for letting this issue lapse for so long, but I have now finally had some bandwidth to be able to revisit Riemann with Graylog.

We're averaging around 12,000 messages per second with peaks up to 20,000 at the moment.

Graylog is now version 2.1.2 with Java 1.8.0_111, still running on CentOS 6 unfortunately.

I'm now running the version 1.1.3 Riemann plugin.

I enabled it today again for the first time, and see exactly the same symptoms. Message processing drops to around 4,000 messages per second, so this is significant.

Once I remove the output from the stream in Graylog, processing once again returns to normal.

What else can I provide to help with this issue? In my lab environment it works perfectly, but we're talking maybe 100 messages per second tops.

Cheers, Pete

mariussturm commented 7 years ago

@peteGS did you try to set a bigger batch size? E.g. 500 or 1000, keep in mind that sending batches is only done when using TCP.

peteGS commented 7 years ago

Hi Marius,

Yes, my batch size is set to 1000, here's the output configuration settings from Graylog:

I've set up Grafana to graph various metrics on the Graylog and Elasticsearch nodes to see if there's any correlation with load etc. at the same time, but nothing stands out. Load drops off with the throughput, it doesn't increase to indicate that's a bottleneck or anything along those lines. Elasticsearch load and disk I/O drops off as well (as expected with decreased throughput).

Cheers, Pete

On Wed, Jan 4, 2017 at 12:23 AM, Marius Sturm notifications@github.com wrote:

@peteGS https://github.com/peteGS did you try to set a bigger batch size? E.g. 500 or 1000, keep in mind that sending batches is only done when using TCP.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog-plugin-riemann/issues/3#issuecomment-270124485, or mute the thread https://github.com/notifications/unsubscribe-auth/AL8YY66EgpUgPBriKrH5swMHqahUzd5lks5rOlnVgaJpZM4FXDTd .

mariussturm commented 7 years ago

Do you see the same degradation when you use the IP of the Riemann server instead of the FQDN>

peteGS commented 7 years ago

I haven't tried that Marius, will give it a shot today and see if there's a difference.

Cheers, Pete

On Fri, Jan 13, 2017 at 12:24 AM, Marius Sturm notifications@github.com wrote:

Do you see the same degradation when you use the IP of the Riemann server instead of the FQDN>

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog-plugin-riemann/issues/3#issuecomment-272175567, or mute the thread https://github.com/notifications/unsubscribe-auth/AL8YY5GbUL4SOd_WTbB17KPnrc_TkKK9ks5rRjeLgaJpZM4FXDTd .

peteGS commented 7 years ago

Unfortunately no difference using the IP rather than FQDN Marius.

Cheers, Pete

On Fri, Jan 13, 2017 at 11:54 AM, Pete GS starpoint73@gmail.com wrote:

I haven't tried that Marius, will give it a shot today and see if there's a difference.

Cheers, Pete

On Fri, Jan 13, 2017 at 12:24 AM, Marius Sturm notifications@github.com wrote:

Do you see the same degradation when you use the IP of the Riemann server instead of the FQDN>

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Graylog2/graylog-plugin-riemann/issues/3#issuecomment-272175567, or mute the thread https://github.com/notifications/unsubscribe-auth/AL8YY5GbUL4SOd_WTbB17KPnrc_TkKK9ks5rRjeLgaJpZM4FXDTd .

peteGS commented 7 years ago

Hi Marius, any further developments on this? I had another scenario yesterday afternoon where someone directed an excess of data at my Graylog instance which filled my ElasticSearch nodes and then the Graylog journal over night, and I was hoping this particular plugin could help me identifying these scenarios and alert early based on an increased rate of messages from a particular source.

I suspect there may be some increased interest from the rest of our business in the near future (Splunk is not a cheap option for us) so I'd really like to try to get some maturity in monitoring my Graylog/ElasticSearch platform and I think this would help greatly.

If there are things I can do to help narrow down and identify the issue then please let me know, I am more than happy to put effort into assisting with a resolution, I'm just unable to help on the code side of things from a lack of skills/knowledge perspective.

mariussturm commented 7 years ago

Hi @peteGS, currently I don't see how we can improve the performance of the Riemann output. With batched transfer and DNS lookup time eliminated we should be at max performance. Upcoming Graylog versions will have better output management in general, so one slow output will not block the others. But there is no ETA for that feature yet.

peteGS commented 7 years ago

No worries Marius.

Do you think this the issue is essentially a Graylog processing/output issue or could it still be a Riemann client issue?

If it's possible it's the latter, do you think it's worth me raising an issue on the Riemann side of things? Happy to do so if there's a chance it will help.

Things have changed a bit here so I may be looking at moving our Graylog/ElasticSearch environment to AWS and if this is the case I'll be deploying it with CentOS 7. It'll be interesting to see if there's a difference in performance with that vs. CentOS 6.

Cheers, Pete

On Mon, May 29, 2017 at 6:53 PM, Marius Sturm notifications@github.com wrote:

Hi @peteGS https://github.com/petegs, currently I don't see how we can improve the performance of the Riemann output. With batched transfer and DNS lookup time eliminated we should be at max performance. Upcoming Graylog versions will have better output management in general, so one slow output will not block the others. But there is no ETA for that feature yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graylog-labs/graylog-plugin-riemann/issues/3#issuecomment-304609428, or mute the thread https://github.com/notifications/unsubscribe-auth/AL8YY06UGAy1TatMyI77UK8z4gp9BRliks5r-od2gaJpZM4FXDTd .

mariussturm commented 7 years ago

You would need a benchmark to identify the actual bottleneck. A small program that send Riemann events as fast as possible. If you get better performance with that you know that the problem is on the Graylog side. Maybe that's a good idea anyway for debugging your specific environment.

peteGS commented 7 years ago

Great idea Marius, I'll have to look into that, sounds like a worthwhile test for sure.

Cheers, Pete

On Thu, Jun 8, 2017 at 6:33 PM, Marius Sturm notifications@github.com wrote:

You would need a benchmark to identify the actual bottleneck. A small program that send Riemann events as fast as possible. If you get better performance with that you know that the problem is on the Graylog side. Maybe that's a good idea anyway for debugging your specific environment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/graylog-labs/graylog-plugin-riemann/issues/3#issuecomment-307037054, or mute the thread https://github.com/notifications/unsubscribe-auth/AL8YY2pd51r96x5N21KQpW59AbSrMW5Jks5sB7HXgaJpZM4FXDTd .