Skip pattern for single line entries that are mixed in with multiline

ppf2 commented 8 years ago

Say you have a log file that has mixed multiline logs and single line logs. Here is an example of one where there there are multiline logs (app java stack exception) and singleline logs (access log) mixed in the same log file.

2016-01-14T11:10:03.09-0500 [App/0]      OUT 16:10:03.087 [pool-13-thread-1] ERROR not unable to send heartbeat!
2016-01-14T11:10:03.09-0500 [App/0]      OUT Caused by: java.net.ConnectException: Connection refused
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_51-]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:117) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[httpclient-4.5.1.jar!/:4.5.1]
2016-01-14T11:23:47.36-0500 [Access/2]      OUT host_name - [14/01/2016:16:23:46 +0000] "GET /app-path/health HTTP/1.1" 200 456 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/10.4.39.1 ... and so on
2016-01-14T11:10:03.09-0500 [App/0]      OUT 16:12:03.087 [pool-14-thread-1] ERROR not unable to send heartbeat!
2016-01-14T11:10:03.09-0500 [App/0]      OUT Caused by: java.net.ConnectException: Connection refused
2016-01-14T11:10:03.09-0500 [App/0]      OUT    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_51-] ....

If you use a multiline pattern to match for the multiline, it will correctly emit the multiline events, but will skip the single line entries entirely.

One workaround today is to not use the multiline codec but use the multiline filter in the LS pipeline. But with the multiline filter being deprecated soon, it is recommended to have a solution that uses the multiline codec. Maybe there is a way to provide a skip_pattern configuration so users can define another pattern to match for lines (the single lines) that should be skipped over so that these lines will still be generated as events even if the multiline codec is used.

guyboertje commented 8 years ago

@ppf2 - so I guess what you are saying is:

In any one identity based stream of lines there exists identifiable sub-streams within.

Clarification: ATM, the file input will map a codec instance to a path (identity). With a sub-stream pattern, from your example, "\s[.+]\s\s" that can be applied to each line to extract a sub-stream identity so different buffers/logic can be used for each sub-stream.

WDYT?

Would each sub-stream need its own pattern, what and negate settings?

I have been experimenting with a finite state machine class that could be used for each sub-stream.

/cc @jordansissel

johnarnold commented 8 years ago

More specific (sub) stream identity patterns (similar to the multiline filter) definitely is useful for me.

logstash-plugins / logstash-codec-multiline

Skip pattern for single line entries that are mixed in with multiline #22