openshift / origin-aggregated-logging

139 stars 230 forks source link

[RFE] Enable fluentd to correctly parse multi-line json log entries such as java stacktraces. #28

Closed ivanthelad closed 8 years ago

ivanthelad commented 9 years ago

if a Exception is thrown from a java application then fluentd will send for each line in the stacktrace a new event to elastic seach. this is undesired and produces a lot of noise in ES. it is hard to find the real errors

The parser plugin should be enabled and preconfigured for the above case (i'm sure there others ) http://docs.fluentd.org/articles/parser-plugin-overview

From docs

One more example, you can parse Java like stacktrace logs with multiline. Here is a configuration example.

format multiline formatfirstline /\d{4}-\d{1,2}-\d{1,2}/ format1 /^(?)] (?[^\s]+)(?._)/ If you have a following log:

2013-3-03 14:27:33 [main] INFO Main - Start 2013-3-03 14:27:33 [main] ERROR Main - Exception javax.management.RuntimeErrorException: null at Main.main(Main.java:16) ~[bin/:na] 2013-3-03 14:27:33 [main] INFO Main - End It will be parsed as:

2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"INFO","message":" Main - Start"} 2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"ERROR","message":" Main - Exception\njavax.management.RuntimeErrorException: null\n at Main.main(Main.java:16) ~[bin/:na]"} 2013-03-03 14:27:33 +0900 zimbra.mailbox: {"thread":"main","level":"INFO","message":" Main - End"}

ewolinetz commented 8 years ago

Hi @ivanthelad, our application logs come from Docker containers in the form of JSON, so return characters should be included and encapsulated in the message payload already. Please let us know if you are seeing otherwise though!

We had put the above example into use for our syslog entries. There were some log messages that matched the scenario where we had a line that did not begin with a timestamp and was actually part of a previous message.

ivanthelad commented 8 years ago

@ewolinetz Unfortunately we are seeing this split across multiple json messages per stack trace. it must be noted that this output is coming from a java process. I don'T have a example to show you at the mo, will try to organise one

ivanthelad commented 8 years ago

hi @ewolinetz ,

I've come across this issue again. Here is a example of a java exception(the exception is irrelevant ) coming from the openshift provided elastic search. I just included two examples, but it is clear from the message attribute that its seeing each line in stack trace as a separate log message.

{ "_index": "logging.2016.02.10", "_type": "fluentd", "_id": "AVLK0xzq-Z6rwixlM5xV", "_score": null, "_source": { "time": "2016-02-10T10:56:12.778694887Z", "docker_container_id": "37873a3c3db8893b3828ab0b4d7a348b0b52a94c7f38ffea7f46c0f52ccf792b", "kubernetes_namespace_name": "logging", "kubernetes_pod_id": "06829218-cfe2-11e5-8126-2cc2600091c2", "kubernetes_pod_name": "logging-es-x7ywtsk5-2-h8xej", "kubernetes_container_name": "elasticsearch", "kubernetes_labels_component": "es", "kubernetes_labels_deployment": "logging-es-x7ywtsk5-2", "kubernetes_labels_deploymentconfig": "logging-es-x7ywtsk5", "kubernetes_labels_provider": "openshift", "kubernetes_host": "node2.example.com", "hostname": "node2.example.com", "message": "\tat org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)\n", "version": "1.0.6" }, "fields": { "time": [ 1455101772778 ] }, "sort": [ 1455101772778 ] }

{ "_index": "logging.2016.02.10", "_type": "fluentd", "_id": "AVLK0xzq-Z6rwixlM5xM", "_score": null, "_source": { "time": "2016-02-10T10:56:12.778151295Z", "docker_container_id": "37873a3c3db8893b3828ab0b4d7a348b0b52a94c7f38ffea7f46c0f52ccf792b", "kubernetes_namespace_name": "logging", "kubernetes_pod_id": "06829218-cfe2-11e5-8126-2cc2600091c2", "kubernetes_pod_name": "logging-es-x7ywtsk5-2-h8xej", "kubernetes_container_name": "elasticsearch", "kubernetes_labels_component": "es", "kubernetes_labels_deployment": "logging-es-x7ywtsk5-2", "kubernetes_labels_deploymentconfig": "logging-es-x7ywtsk5", "kubernetes_labels_provider": "openshift", "kubernetes_host": "node2.example.com", "hostname": "node2.example.com", "message": "\tat org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)\n", "version": "1.0.6" }, "fields": { "time": [ 1455101772778 ] }, "sort": [ 1455101772778 ] }

ewolinetz commented 8 years ago

Hi @ivanthelad

I was able to recreate this, however the issue to overcome here is finding a way to process multiline log messages that can be of different formats since we could have log messages from many different platforms. And since a container name cannot be matched to platform used nor, to my knowledge, could it currently be looked up at the time we are processing the logs we have no way to know what type of format to expect.

We will continue investigation on how this might be accomplished.

For example, Elasticsearch logs (stack trace provided) are structured like this prior to being read in:

{"log":"[2016-02-10 15:45:51,100][ERROR][com.floragunn.searchguard.service.SearchGuardConfigService] [Infamnia] Try to refresh security configuration but it failed due to org.elasticsearch.action.NoShardAvailableActionException: [.searchguard.logging-es-0buwfnfm-2-qfoor][4] null\n","stream":"stdout","time":"2016-02-10T15:45:51.105650005Z"}
{"log":"org.elasticsearch.action.NoShardAvailableActionException: [.searchguard.logging-es-0buwfnfm-2-qfoor][4] null\n","stream":"stdout","time":"2016-02-10T15:45:51.105689399Z"}
{"log":"\u0009at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:175)\n","stream":"stdout","time":"2016-02-10T15:45:51.105702191Z"}
{"log":"\u0009at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:155)\n","stream":"stdout","time":"2016-02-10T15:45:51.105710445Z"}
{"log":"\u0009at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:89)\n","stream":"stdout","time":"2016-02-10T15:45:51.10572886Z"}
{"log":"\u0009at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:55)\n","stream":"stdout","time":"2016-02-10T15:45:51.105737729Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167)\n","stream":"stdout","time":"2016-02-10T15:45:51.10574562Z"}
{"log":"\u0009at io.fabric8.elasticsearch.plugin.KibanaUserReindexAction.apply(KibanaUserReindexAction.java:74)\n","stream":"stdout","time":"2016-02-10T15:45:51.105753291Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105761842Z"}
{"log":"\u0009at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply0(SearchGuardActionFilter.java:141)\n","stream":"stdout","time":"2016-02-10T15:45:51.105769461Z"}
{"log":"\u0009at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply(SearchGuardActionFilter.java:89)\n","stream":"stdout","time":"2016-02-10T15:45:51.105777254Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105785258Z"}
{"log":"\u0009at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:64)\n","stream":"stdout","time":"2016-02-10T15:45:51.10579554Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105802579Z"}
{"log":"\u0009at io.fabric8.elasticsearch.plugin.ActionForbiddenActionFilter.apply(ActionForbiddenActionFilter.java:48)\n","stream":"stdout","time":"2016-02-10T15:45:51.105810007Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105817065Z"}
{"log":"\u0009at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)\n","stream":"stdout","time":"2016-02-10T15:45:51.105825581Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105833839Z"}
{"log":"\u0009at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)\n","stream":"stdout","time":"2016-02-10T15:45:51.105841452Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.105848921Z"}
{"log":"\u0009at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)\n","stream":"stdout","time":"2016-02-10T15:45:51.105856733Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)\n","stream":"stdout","time":"2016-02-10T15:45:51.10586437Z"}
{"log":"\u0009at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82)\n","stream":"stdout","time":"2016-02-10T15:45:51.105872478Z"}
{"log":"\u0009at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:98)\n","stream":"stdout","time":"2016-02-10T15:45:51.105880233Z"}
{"log":"\u0009at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:193)\n","stream":"stdout","time":"2016-02-10T15:45:51.105887709Z"}
{"log":"\u0009at org.elasticsearch.action.get.GetRequestBuilder.doExecute(GetRequestBuilder.java:201)\n","stream":"stdout","time":"2016-02-10T15:45:51.105895308Z"}
{"log":"\u0009at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)\n","stream":"stdout","time":"2016-02-10T15:45:51.105902476Z"}
{"log":"\u0009at com.floragunn.searchguard.service.SearchGuardConfigService.reloadConfig(SearchGuardConfigService.java:81)\n","stream":"stdout","time":"2016-02-10T15:45:51.105918907Z"}
{"log":"\u0009at com.floragunn.searchguard.service.SearchGuardConfigService.access$500(SearchGuardConfigService.java:41)\n","stream":"stdout","time":"2016-02-10T15:45:51.105928144Z"}
{"log":"\u0009at com.floragunn.searchguard.service.SearchGuardConfigService$Reload.run(SearchGuardConfigService.java:111)\n","stream":"stdout","time":"2016-02-10T15:45:51.105935956Z"}
{"log":"\u0009at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n","stream":"stdout","time":"2016-02-10T15:45:51.105943798Z"}
{"log":"\u0009at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)\n","stream":"stdout","time":"2016-02-10T15:45:51.105951198Z"}
{"log":"\u0009at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)\n","stream":"stdout","time":"2016-02-10T15:45:51.105959662Z"}
{"log":"\u0009at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)\n","stream":"stdout","time":"2016-02-10T15:45:51.105967395Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n","stream":"stdout","time":"2016-02-10T15:45:51.105974421Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n","stream":"stdout","time":"2016-02-10T15:45:51.105981317Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:745)\n","stream":"stdout","time":"2016-02-10T15:45:51.105988245Z"}

and Kibana logs are structured like this before being read in:

{"log":"{\"name\":\"Kibana\",\"hostname\":\"logging-kibana-1-1nayr\",\"pid\":8,\"level\":50,\"err\":{\"message\":\"Request Timeout after 5000ms\",\"name\":\"Error\",\"stack\":\"Error: Request Timeout after 5000ms\\n    at null.\u003canonymous\u003e (/opt/app-root/src/src/node_modules/elasticsearch/src/lib/transport.js:282:15)\\n    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)\"},\"msg\":\"\",\"time\":\"2016-02-10T15:35:58.563Z\",\"v\":0}\n","stream":"stdout","time":"2016-02-10T15:35:58.565416267Z"}
sosiouxme commented 8 years ago

https://bugzilla.redhat.com/show_bug.cgi?id=1294168 is a corresponding RFE

donovanmuller commented 8 years ago

As part of the implementation to this, would it be possible to allow custom filters etc. to be passed into the deployer or provide an easy way of pointing to a different fluentd image? (i.e. one that has custom filters etc. for a specific environment)

Also, any idea when the implementation to this is planned? Not as useful as it could be in a primarily Java based environment as is...

sosiouxme commented 8 years ago

It's not specifically planned yet. It will probably require substantial changes to the logging infrastructure to get from line-based log entries to a more semantic approach -- perhaps enabling something beyond stdin/stdout for docker logs.

ullgren commented 8 years ago

Agree with @donovanmuller would be usefull if each deployment configuration could provide it's own custom filters and/or fluentd config.

sosiouxme commented 8 years ago

I think in the near term we will be separating out fluentd config from the image into a configmap which will make it pretty easy to customize the config (as long as it doesn't require new plugins etc). Then if you know your environment is all-Java and want to customize the filters you can do so.

Customizing filters per-DC is a whole different problem. I think what we might do for that use case is simply provide a logging endpoint that accepts a common logging format and rely on apps to customize their config to log whatever you like within the constraints of that format. There are security concerns to resolve with that approach. It is not quite as near-term but something like this is clearly desired and may be prioritized soon.

jcantrill commented 8 years ago

@sosiouxme @ewolinetz any reason not to close this issue in favor of the outstanding trello card we have: https://trello.com/c/k617Elhi

ewolinetz commented 8 years ago

@jcantrill I don't see a reason not to; closing this issue out in favor of tracking it in trello. From our observations it is more simple to update the log appenders to print out stack traces as JSONL rather than the current format that fluentd is reading in

pmehra06 commented 7 years ago

Even after reading the whole issue and trello document, i am still not able to find the solution of "How to correctly parse java stack trace", it would be helpful if you share a document for the same.

StevenACoffman commented 6 years ago

@pmehra06 fluent-plugin-detect-exceptions works well for most languages for fluentd. Is that an option in openshift?

jcantrill commented 6 years ago

This is an option for openshift. It will require some testing. I will add it to the card. Alternatively, if you have control over the log library, you can configure the log library to output the stack in a single JSON blob

sosiouxme commented 6 years ago

So while that's a cool capability, it really feels to me like the wrong place to do it. You're going to spend a lot of processing power trying to parse log streams to look for exceptions on logs that aren't even java or python.

The base problem is we don't have a reliable way to know what's in a log. We don't know which parser to apply. That seems like something that can be solved, just not at the logging solution -- it's too late to parse it out of a wall of text with no metadata. It's possible, obviously, it just seems like it would be so much better to attach some metadata at the image/container/pod definition to inform potential log parsers. It could be optional until it's a de facto standard. Just thinking out loud here, this is hardly the place for the discussion :)

StevenACoffman commented 6 years ago

@sosiouxme So for upcoming fluent-bit release fluent-bit will read the kubernetes pod annotations to suggest the parser to use (e.g: logging.parser = apache, or logging.parser = java-multiline). It's a nice model.

However, at present, for fluentd, the only solution in widespread use (e.g. all of Google Cloud platform) is fluent-plugin-detect-exceptions.

addozhang commented 6 years ago

Have a try with fluent-plugin-detect-exceptions. But met some gotchas.

fluent-plugin-detect-exceptions combine stacktrace only one without primary log content. So made some change to meet this.

Anyone can workout a more official solution. config-map attached below.

fluent-plugin-detect-exceptions#41

# This file is the fluentd configuration entrypoint. Edit with care.
@include configs.d/openshift/system.conf
# In each section below, pre- and post- includes don't include anything initially;
# they exist to enable future additions to openshift conf as needed.

## sources
## ordered so that syslog always runs last... @include configs.d/openshift/input-pre-*.conf @include configs.d/dynamic/input-docker-*.conf @include configs.d/dynamic/input-syslog-*.conf @include configs.d/openshift/input-post-*.conf
##
<label @INGRESS>
## filters
  <match time.**>
    @type detect_exceptions
    @label @INGRESS
    remove_tag_prefix time
    message log
    languages time
    multiline_flush_interval 0.5
  </match>
  @include configs.d/openshift/filter-pre-*.conf
  @include configs.d/openshift/filter-retag-journal.conf
  @include configs.d/openshift/filter-k8s-meta.conf
  @include configs.d/openshift/filter-kibana-transform.conf
  @include configs.d/openshift/filter-k8s-flatten-hash.conf
  @include configs.d/openshift/filter-k8s-record-transform.conf
  @include configs.d/openshift/filter-syslog-record-transform.conf
  @include configs.d/openshift/filter-viaq-data-model.conf
  @include configs.d/openshift/filter-post-*.conf
##
</label>

<label @OUTPUT>
## matches
  @include configs.d/openshift/output-pre-*.conf
  @include configs.d/openshift/output-operations.conf
  @include configs.d/openshift/output-applications.conf
  # no post - applications.conf matches everything left
##
</label>
bszeti commented 6 years ago

Did you consider adding the fluentd-concat plugin to the image. Then we could configure a filter to handle multiline logs if needed. The performance implication is a question of course... https://github.com/fluent-plugins-nursery/fluent-plugin-concat

richm commented 6 years ago

@bszeti https://github.com/openshift/origin-aggregated-logging/pull/1302