[WIP] proper delimiter support for line based input

colinsurprenant commented 8 years ago

introduction of the BufferedTokenizer to correctly parse delimited input across read blocks. the previous implementation first assumed hardcoded \n delimiter and also assumed that the delimiter was always part of the same input block which is obviously not true for streaming data (stdin, tcp, etc) which is read by fixed sized chunks and passed to the codec in which case lines split across blocks were not correctly processed.
this PR is flagged as a WIP because I have inlined many TODO:s that need to be clarified with the help of @guyboertje who introduced the latest changes related to identity map and auto flushing. In particular I think the implementation assumes it will run under an identity map codec which I do not believe is true, for example if used with the stdin input.
the pre-PR implementation assumed the implicit line termination of the passed data to the decode method. Obviously this is wrong, and cannot work with the BufferedTokenizer. All specs needed to be adjusted to add a line terminator the the passed lines.

colinsurprenant commented 8 years ago

I just realized I forgot to run the other specs, not just multiline_spec.rb ... I will and fix what breaks :P

colinsurprenant commented 8 years ago

So thinking about it, I suggest we refactor multiline and extract the identity map support stuff required in the multiline codec per se into its own module or something like that. Currently, as it is, the multiline codec code & structure is really confusing, it's not clear what does what.

guyboertje commented 8 years ago

Some observations:

The code is not dependent on the identity mapper. The introduction of auto_flush meant having to have an out-of-band route to the queue. The block supplied to the existing decode method cannot be cached to be used out-of-band. This requires a new mechanism, I chose to use a listener. Effectively, to use ML codec + auto_flush with an input, it needs to be enhanced with support for listeners. The identity mapper transparently passes accept or decode to any codec that it wraps.
There is a bug in the existing code: in the closemethod, auto_flush should only be called if @last_seen_listener is not nil.
For the file input to work with this change each line received from filewatch will need a delimiter appended to it (but only if a ML codec is specified), and as the file input already has a delimiter that it passes to filewatch for its BufferedTokenizer it must use what it defined in the ML codec, but again only if a ML codec is specified - just in case the user specified a different delimiter (from "\n") in the codec config.

colinsurprenant commented 8 years ago

thanks @guyboertje for the clarifications.

I will fix the close + auto_flush + @last_seen_listener which I believe should also fix the current dependency on using the identity mapper so that @last_seen_listener is properly initialized.
I still don't think that the identity mapper specific code, the accept method and listener stuff should be part of the codec and probably factored out into a separate module because it makes the code harder to understand and there is no documentation explaining the purpose of these methods. one has to go and dig in the identity mapper to figure the logic and so on.
very good point for the delimiter appending, and I wonder if we should not just make that configurable with a "auto_append_delimiter" option which defaults to appending "\n" by default to keep backward compatibility? what do you think? because effectively, as is, this PR changes the current codec behaviour which might not be a good idea. Thoughts?

guyboertje commented 8 years ago

I had always intended to move the imc class into LS core, when any bugs identified in its use with the file input were ironed out - because it is easier to release new versions of plugins in quick succession.

Further, I had imagined that the component channel idea would take off and the accept method in the ML codec would be subsumed by the component.

colinsurprenant commented 8 years ago

@guyboertje but my problem is not the IdentityMapCodec per se, its the specific support code for it in the Multiline (accept and listener stuff).

For the component idea, that will probably happen at some point but in the mean time we have to live with the current design :P

guyboertje commented 8 years ago

I am conflicted about this PR. It forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer. It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec but only if it is a bytes oriented one - line, json_lines, multiline (after this patch)

Inputs supply data in flavours text bytes text lines serialized protocol representations

Codecs consume data in flavours text bytes text lines serialized protocol representations

Practically, in terms of GH issues and or support queries, for production use of the ML codec, we see a small usage of it with text bytes inputs but a massive usage of it with text lines inputs. This change is positive for the small usage but negative (double buffering and newline append data mutation) for the massive usage (file input). I struggle to see any urgency for this change.

It is a bug that we cannot detect when a flavour mismatch occurs in a users config, to either: a) warn the user or prevent LS from starting or b) correct the mismatch by inserting an "adapter" between the input and the codec. e.g a bytes to lines adapter between stdin, tcp inputs and multiline, json codecs.

I want us to work on the problem of appropriate channel composition. We could never move the codec out of the input without this.

andrewvc commented 8 years ago

@guyboertje I think you raise some very valid concerns. It seems the fundamental issue here is that there's no type information carried from input->codec. The contracts for other things are clear, (filters are event -> [event], outputs are [event] -> nil), but there's just a lot of gray area here.

I like your proposal for automatic flavor detection + adapters.

colinsurprenant commented 8 years ago

@guy

It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec

I agree we should not change external default behaviour and I already proposed to make sure we keep previous/legacy behaviour in that respect and not have any current input modified to append a delimiter, but instead make that behaviour optional and keep the legacy behaviour as default.

forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer

The thing is that this codec, in its current released state is actually in-between between line-oriented and text-bytes. It was actually meant as text-bytes since it was doing a split("\n") on the input so it was assuming a blob of line delimited text.

  def decode(text, &block)
    text = @converter.convert(text)
    text.split("\n").each do |line|
      ...

But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Furthermore, introducing the buffered tokenizer essentially does a .split(@delimiter) on the data and will not incur double buffering and should not be more costly on line oriented data.

On the other hand, using the buffered tokenizer in the context of text-bytes will actually provide the correct expected behaviour of supporting cross-blocks lines.

I still believe that using the buffered tokenizer and respecting the current behaviour is the right thing to do since it will make this codec work correctly the way it was intended to and with all types of inputs without any performance penalty.

@andrewvc @guy I agree we have a contract confusion in that respect between inputs and codecs and we should address this in a separate conversation and have a proper design discussion around it. Until we sort this out I believe this PR is correct and will provide value.

colinsurprenant commented 8 years ago

I have reverted to default behaviour of assuming line-based input data (and all original tests are passing as-is) and added this new config option:

# Assume data received from input plugin as line based. For some input plugins
# like stdin or tcp/udp data is not line based and this option should be set to false.
config :line_based_input, :validate => :boolean, :default  => true

I believe this is an improvement as it makes the plugin more flexible and fixes a bug which made it not work correctly for non lined-based input.

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec, I suggest to move it into a module and include that module. that way we will avoid mixin the "real" codec public interface with the identity map support interface.

colinsurprenant commented 8 years ago

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized. I am surprised we did not catch this in reviews, or was it assumed that it would always run within the identity mapper?

guyboertje commented 8 years ago

@colinsurprenant

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec

Please understand that the implementation of identity mapping and then later auto-flush was done MVP style so we could iterate quickly by releasing new versions of this library when problems were found like e.g the 20000 map limit.

You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed. But it has nothing to do with an identity map context; the IMC is transparent. It has everything to do with inputs needing to be modified to provide the listener mechanism. The listener implementation transfers the references to objects defined in the input held in the decode callback block binding to a listener and establishes a callback API on the listener such that callbacks are by method call and not block call.

colinsurprenant commented 8 years ago

@guyboertje

You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

At this stage my concern is not specifically about removing this code from this plugin repo (I agree it should eventually) but the IMC support code in the MLC code like the accept method and listener stuff. I am suggesting that this code should be factored out of the MLC code and put into a module so that the MLC code looks like a proper codec with the usual interface methods implemented. having this IMC support code in the IMC code makes it very confusing to understand the logic of what is going on.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed.

I am sorry about the confusion. I am talking about the auto_flush code itself and what I understood is the bug you pointed to was specifically calling auto_flush from the close method. What I am saying is that auto_flush itself depends on the listener stuff which is only relevant in the context of the IMC but auto_flush will run even if not in the context of the IMC.

the IMC is transparent.

I am not sure what you mean by "transparent"? It is not transparent in the MLC code, the accept method are required to support the IMC, plus the listener support code no?

jsvd commented 8 years ago

@colinsurprenant have you considered implementing this with logstash-codec-line instead of implementing buftok directly? This is already done in other plugins.

colinsurprenant commented 8 years ago

@jsvd actually I thinks it is a bad idea, see https://github.com/logstash-plugins/logstash-codec-json_lines/issues/17

colinsurprenant commented 8 years ago

putting this PR on-hold for @guyboertje upcoming refactor PR of the auto flushing and timers tasks.

jsvd commented 8 years ago

is this still blocked?

guyboertje commented 8 years ago

I would think that Event Mills coming soon, will obsolete this.

cwurm commented 7 years ago

@guyboertje are Event Mills still coming?

If not, we could pick this back up.

jordansissel commented 7 years ago

@cwurm eventually mills may arrive.

Ultimately, this problem is due to a mismatch between how the file input works (what the multiline codec was intended to target) and how things like tcp input work. The file input already emits lines.

Fixing this before mills arrive (if/when they arrive) is possible.

cwurm commented 7 years ago

@jordansissel I think we should, there have been issues and PRs dating back to 2015: https://github.com/logstash-plugins/logstash-codec-multiline/pull/38 https://github.com/logstash-plugins/logstash-codec-multiline/issues/37, https://github.com/logstash-plugins/logstash-codec-multiline/issues/14, https://github.com/logstash-plugins/logstash-codec-multiline/pull/6

@colinsurprenant Would you be interested to pick this back up? If not, I'd see if I can find the time for it.

colinsurprenant commented 5 years ago

@cwurm just did in #63

logstash-plugins / logstash-codec-multiline

[WIP] proper delimiter support for line based input #26