Closed andrewvc closed 8 years ago
I think there are a few issues at stake here. Two classic problems are with the lumberjack and file inputs.
also relates to: logstash-plugins/logstash-input-file#44 logstash-plugins/logstash-input-lumberjack#5
@colinsurprenant ++ your codec per stream identity idea
This shouldn't be too hard to implement in the lumberjack since we already clone the codec per connection, its just a matter to define the stream_identity and use that instead.
Right. I think this is what it basically boils down to: having the necessary properties exposed to the codec, for a given input context, to establish a stream identity. this is maybe an api change we should introduce at the codec level? thoughts?
@colinsurprenant to clarify, you're saying we should just implement stream_identity as the filter does?
@andrewvc well, we can't because the codec does not have the event context since the event is not yet decoded, the stream identity must be established/provided before entering the codec no? for example, the file input can provide the path information.
@colinsurprenant ah, are you saying we should decorate it before its encoded? Instead of passing the codec the raw line maybe we could pass it a LogStash::Event with 'message' set and the correct tag?
Or are you talking about parameterizing it some other way.
@andrewvc I don't think this would work, think about non-line oriented inputs like tcp/udp where we do a sysread(16384)
for example, the message field will not be the sysread result, the codec first need to perform its decode logic before assigning the message field.
I think we should parametrize the codecs with maybe a stream_id field or something around these lines?
@colinsurprenant works for me, plus it'll be transparent to the user which is nice.
I don't thik we need an API change for this. The multiline filter uses a Hash of string_identity
=> pending_events
. For the codec, we can achieve the same with a Hash of stream_identity
=> codec_instance
.
For lumberjack, we can use the 'file' field that lsf/lumberjack provides for the identity. The input can be responsible for this mapping of stream to codec.
As a way of gathering state, I have confirmed the following:
Test case:
# Each command run roughly 1 second apart.
% echo "hello" >> /tmp/a
% echo " b hello" >> /tmp/b
% echo " world" >> /tmp/a
% echo "blah" >> /tmp/a
LSF config:
"files": [ { "paths": [ "/tmp/a", "/tmp/b" ] } ]
Logstash run showing the oops:
% bin/logstash -e 'input { lumberjack { codec => multiline { pattern => "^\s+" what => previous } port => 4444 ssl_certificate => "/users/jls/projects/logstash/lumberjack.crt" ssl_key => "/users/jls/projects/logstash/lumberjack.key" } }'
...
"message" => "hello\n b hello\n world",
...
You can see b hello
which was written to /tmp/b was merged in the same stream as /tmp/a.
oups I should have seen this when doing work on this plugin :(
Yes for using the "file" field from the LSF.
I don't think the logic of the stream_identity
belong in the codec, I think our codecs should stay simple and take input A and produce B, we can create a simple class that wrap the codec and do the routing.
# Terrible naming ahead, sorry it's late.
multiplexer = CodecMultiplexer.new(@codec)
multiplexer.decode(message["file"], :stream_identity => message["file"]) do |event|
end
Another syntax.
multiplexer = CodecMultiplexer.new(@codec, :stream_identity => :file)
multiplexer.decode(message) do |event|
end
Is there a release that this improvement is targeted for implementation? For example, are you guys going to try to get it into 2.0?
Let me recap:
stream_identity
, like the path information of the file input plugin identifies different streams.stream_identity
demultiplexing does not belong in the codec then separate codec instances must be created per stream.Could we add a base input plugin option to enable specifying an arbitrary stream_identity
, then rework & decouple the codec instantiation code to automatically handle stream demultiplexing using multiple codecs instances, as needed, upon having this stream_identity
option set in the input plugin?
@jordansissel @ph @andrewvc Thoughts?
specific input plugins have specific domain knowledge of their stream_identity, like the path information of the file input plugin identifies different streams.
Correct
if we agree that the stream_identity demultiplexing does not belong in the codec then separate codec instances must be created per stream.
Correct too, we could proxy the call to multiples codec instances.
But let me add this:
Codecs know if they care about stream identity or not, What I mean by that is
multiline requires stream_identity
to correctly does his job, json codec doesn't, so we need some kind of @codec.require_stream_identity?
at the codec level.
didn't we just say that codecs don't have to know about stream_identity
if we handle demultiplexing outside the codecs and simply instantiate multiple codecs? In this case codecs don't have to care about stream identity, they can just assume they only deal with a single consistent stream.
If we do that, we only need to refactor the way we instantiate codecs for the input plugins and have a way for the input plugins to address the right codec depending on its specific knowledge of the stream_identity
.
So what you are suggesting if even if the codec could operate without the need for stream identity (like the json codec) we would always create one if the input
can be affected by it. This will provide some kind of "safety". I think this is fair enough. +1
@colinsurprenant to make sure I understand, we're just adding a default argument to the codec instantiation, and are backing it with a default config setting on the the base codec class yes?
+1 on keeping the "stream awareness" outside of the codec and within the input plugin itself. The TCP input does this today, iirc, and we can do the same for lumberjack and file inputs.
Here's my idea, similarly to what @ph suggested, and I think we could even keep backward compatibility on plugins:
decode(data, stream_identity = nil)
decode
method would take care of doing the bookkeeping of the stream identities vs actual codec instances.This would allow to not touch/update all plugins where dealing with stream identities in codecs is not important and simply update the ones like file & lumberjack inputs to pass the values which makes sense as stream_identity
to the decode
method. for example for file input, the current code:
@tail.subscribe do |path, line|
@codec.decode(line) do |event|
would become:
@tail.subscribe do |path, line|
@codec.decode(line, path) do |event|
For the codecs nothing would change and we can all assume a single codec instance will always deal with a single consistent stream.
Does that make sense?
+1, this would make it backward compatible.
moving discussion into elastic/logstash#4074
Fixed in 2.1
@suyograo I don't understand how it was fixed. In my case I have a gelf input (getting logs from docker containers) on which I would like to apply the multiline codec where my ideal stream identity would be the container id. Is it possible to have a proper identity mapping in this case?
@Teudimundo The input need to be to be fixed to add support for stream identity, the identity is just more information for disambiguation but each plugin need to implement it.
I see you have created https://github.com/logstash-plugins/logstash-input-gelf/issues/37 so we can continue the discussion there.
For the case where the input comes in pre-tagged (like Lumberjack) the stream identity option from the multiline would be useful here.
Affected input issues: