Stream identity support to properly handle multiple sources

andrewvc commented 9 years ago

For the case where the input comes in pre-tagged (like Lumberjack) the stream identity option from the multiline would be useful here.

Affected input issues:

colinsurprenant commented 9 years ago

I think there are a few issues at stake here. Two classic problems are with the lumberjack and file inputs.

in the file input case, the stream identity would need to include the file path and currently the file path is set in the event after the codec decoding, see https://github.com/logstash-plugins/logstash-input-file/blob/57d3d9306ddf067a5281db1eac594409a86526e2/lib/logstash/inputs/file.rb#L194-L200
I think the same is probably true with the lumberjack input where the codec does not know about the lumberjack context see https://github.com/logstash-plugins/logstash-input-lumberjack/blob/342cb177b4cbb4bf054f1aeb1f74e197c8aa1ba5/lib/logstash/inputs/lumberjack.rb#L79-L80
a preferable option would probably to instantiate a separate codec per established stream identity so that the codec itself does not have to handle multiple streams etc.

colinsurprenant commented 9 years ago

also relates to: logstash-plugins/logstash-input-file#44 logstash-plugins/logstash-input-lumberjack#5

andrewvc commented 9 years ago

@colinsurprenant ++ your codec per stream identity idea

ph commented 9 years ago

This shouldn't be too hard to implement in the lumberjack since we already clone the codec per connection, its just a matter to define the stream_identity and use that instead.

colinsurprenant commented 9 years ago

Right. I think this is what it basically boils down to: having the necessary properties exposed to the codec, for a given input context, to establish a stream identity. this is maybe an api change we should introduce at the codec level? thoughts?

andrewvc commented 9 years ago

@colinsurprenant to clarify, you're saying we should just implement stream_identity as the filter does?

colinsurprenant commented 9 years ago

@andrewvc well, we can't because the codec does not have the event context since the event is not yet decoded, the stream identity must be established/provided before entering the codec no? for example, the file input can provide the path information.

andrewvc commented 9 years ago

@colinsurprenant ah, are you saying we should decorate it before its encoded? Instead of passing the codec the raw line maybe we could pass it a LogStash::Event with 'message' set and the correct tag?

Or are you talking about parameterizing it some other way.

colinsurprenant commented 9 years ago

@andrewvc I don't think this would work, think about non-line oriented inputs like tcp/udp where we do a sysread(16384) for example, the message field will not be the sysread result, the codec first need to perform its decode logic before assigning the message field.

I think we should parametrize the codecs with maybe a stream_id field or something around these lines?

andrewvc commented 9 years ago

@colinsurprenant works for me, plus it'll be transparent to the user which is nice.

jordansissel commented 9 years ago

I don't thik we need an API change for this. The multiline filter uses a Hash of string_identity => pending_events. For the codec, we can achieve the same with a Hash of stream_identity => codec_instance.

For lumberjack, we can use the 'file' field that lsf/lumberjack provides for the identity. The input can be responsible for this mapping of stream to codec.

As a way of gathering state, I have confirmed the following:

lsf + lumberjack+multiline - works as expected if and only if each lsf follows exactly 1 file. (because no stream identification is done today).
if lsf follows 2 or more files, the multiline codec will treat them as the same stream (oops).

Test case:

# Each command run roughly 1 second apart.
% echo "hello" >> /tmp/a
% echo "  b hello" >> /tmp/b
% echo "  world" >> /tmp/a
% echo "blah" >> /tmp/a

LSF config:

  "files": [ { "paths": [ "/tmp/a", "/tmp/b" ] } ]

Logstash run showing the oops:

% bin/logstash -e 'input { lumberjack { codec => multiline { pattern => "^\s+" what => previous } port => 4444 ssl_certificate => "/users/jls/projects/logstash/lumberjack.crt" ssl_key => "/users/jls/projects/logstash/lumberjack.key" } }'
...
  "message"    => "hello\n  b hello\n  world",
...

You can see b hello which was written to /tmp/b was merged in the same stream as /tmp/a.

ph commented 9 years ago

oups I should have seen this when doing work on this plugin :(

Yes for using the "file" field from the LSF.

I don't think the logic of the stream_identity belong in the codec, I think our codecs should stay simple and take input A and produce B, we can create a simple class that wrap the codec and do the routing.

# Terrible naming ahead, sorry it's late.
multiplexer = CodecMultiplexer.new(@codec)
multiplexer.decode(message["file"], :stream_identity => message["file"]) do |event|
end

ph commented 9 years ago

Another syntax.

multiplexer = CodecMultiplexer.new(@codec, :stream_identity => :file)
  multiplexer.decode(message) do |event|
end

allenmchan commented 9 years ago

Is there a release that this improvement is targeted for implementation? For example, are you guys going to try to get it into 2.0?

colinsurprenant commented 9 years ago

Let me recap:

specific input plugins have specific domain knowledge of their stream_identity, like the path information of the file input plugin identifies different streams.
if we agree that the stream_identity demultiplexing does not belong in the codec then separate codec instances must be created per stream.
stream demultiplexing should be generic and usable by any plugin+codec combination.

Could we add a base input plugin option to enable specifying an arbitrary stream_identity, then rework & decouple the codec instantiation code to automatically handle stream demultiplexing using multiple codecs instances, as needed, upon having this stream_identity option set in the input plugin?

@jordansissel @ph @andrewvc Thoughts?

ph commented 9 years ago

specific input plugins have specific domain knowledge of their stream_identity, like the path information of the file input plugin identifies different streams.

Correct

if we agree that the stream_identity demultiplexing does not belong in the codec then separate codec instances must be created per stream.

Correct too, we could proxy the call to multiples codec instances.

But let me add this:

Codecs know if they care about stream identity or not, What I mean by that is multiline requires stream_identity to correctly does his job, json codec doesn't, so we need some kind of @codec.require_stream_identity? at the codec level.

colinsurprenant commented 9 years ago

didn't we just say that codecs don't have to know about stream_identity if we handle demultiplexing outside the codecs and simply instantiate multiple codecs? In this case codecs don't have to care about stream identity, they can just assume they only deal with a single consistent stream.

If we do that, we only need to refactor the way we instantiate codecs for the input plugins and have a way for the input plugins to address the right codec depending on its specific knowledge of the stream_identity.

ph commented 9 years ago

So what you are suggesting if even if the codec could operate without the need for stream identity (like the json codec) we would always create one if the input can be affected by it. This will provide some kind of "safety". I think this is fair enough. +1

andrewvc commented 9 years ago

@colinsurprenant to make sure I understand, we're just adding a default argument to the codec instantiation, and are backing it with a default config setting on the the base codec class yes?

jordansissel commented 9 years ago

+1 on keeping the "stream awareness" outside of the codec and within the input plugin itself. The TCP input does this today, iirc, and we can do the same for lumberjack and file inputs.

colinsurprenant commented 9 years ago

Here's my idea, similarly to what @ph suggested, and I think we could even keep backward compatibility on plugins:

replace the codec instantiation in the config to instead instantiate a new "codec multiplexer" class, which will ultimately expose a "backward compatible" decode method like decode(data, stream_identity = nil)
this new multiplexer decode method would take care of doing the bookkeeping of the stream identities vs actual codec instances.

This would allow to not touch/update all plugins where dealing with stream identities in codecs is not important and simply update the ones like file & lumberjack inputs to pass the values which makes sense as stream_identity to the decode method. for example for file input, the current code:

    @tail.subscribe do |path, line|
      @codec.decode(line) do |event|

would become:

    @tail.subscribe do |path, line|
      @codec.decode(line, path) do |event|

For the codecs nothing would change and we can all assume a single codec instance will always deal with a single consistent stream.

Does that make sense?

ph commented 9 years ago

+1, this would make it backward compatible.

colinsurprenant commented 9 years ago

moving discussion into elastic/logstash#4074

suyograo commented 8 years ago

Fixed in 2.1

Teudimundo commented 8 years ago

@suyograo I don't understand how it was fixed. In my case I have a gelf input (getting logs from docker containers) on which I would like to apply the multiline codec where my ideal stream identity would be the container id. Is it possible to have a proper identity mapping in this case?

ph commented 8 years ago

@Teudimundo The input need to be to be fixed to add support for stream identity, the identity is just more information for disambiguation but each plugin need to implement it.

I see you have created https://github.com/logstash-plugins/logstash-input-gelf/issues/37 so we can continue the discussion there.

logstash-plugins / logstash-codec-multiline

Stream identity support to properly handle multiple sources #10