logstash-plugins / logstash-codec-multiline

Apache License 2.0
7 stars 31 forks source link

[WIP] proper delimiter support for line based input #26

Open colinsurprenant opened 8 years ago

colinsurprenant commented 8 years ago
colinsurprenant commented 8 years ago

I just realized I forgot to run the other specs, not just multiline_spec.rb ... I will and fix what breaks :P

colinsurprenant commented 8 years ago

So thinking about it, I suggest we refactor multiline and extract the identity map support stuff required in the multiline codec per se into its own module or something like that. Currently, as it is, the multiline codec code & structure is really confusing, it's not clear what does what.

guyboertje commented 8 years ago

Some observations:

colinsurprenant commented 8 years ago

thanks @guyboertje for the clarifications.

guyboertje commented 8 years ago

I had always intended to move the imc class into LS core, when any bugs identified in its use with the file input were ironed out - because it is easier to release new versions of plugins in quick succession.

Further, I had imagined that the component channel idea would take off and the accept method in the ML codec would be subsumed by the component.

colinsurprenant commented 8 years ago

@guyboertje but my problem is not the IdentityMapCodec per se, its the specific support code for it in the Multiline (accept and listener stuff).

For the component idea, that will probably happen at some point but in the mean time we have to live with the current design :P

guyboertje commented 8 years ago

I am conflicted about this PR. It forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer. It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec but only if it is a bytes oriented one - line, json_lines, multiline (after this patch)

Inputs supply data in flavours text bytes text lines serialized protocol representations

Codecs consume data in flavours text bytes text lines serialized protocol representations

Practically, in terms of GH issues and or support queries, for production use of the ML codec, we see a small usage of it with text bytes inputs but a massive usage of it with text lines inputs. This change is positive for the small usage but negative (double buffering and newline append data mutation) for the massive usage (file input). I struggle to see any urgency for this change.

It is a bug that we cannot detect when a flavour mismatch occurs in a users config, to either: a) warn the user or prevent LS from starting or b) correct the mismatch by inserting an "adapter" between the input and the codec. e.g a bytes to lines adapter between stdin, tcp inputs and multiline, json codecs.

I want us to work on the problem of appropriate channel composition. We could never move the codec out of the input without this.

andrewvc commented 8 years ago

@guyboertje I think you raise some very valid concerns. It seems the fundamental issue here is that there's no type information carried from input->codec. The contracts for other things are clear, (filters are event -> [event], outputs are [event] -> nil), but there's just a lot of gray area here.

I like your proposal for automatic flavor detection + adapters.

colinsurprenant commented 8 years ago

@guy

It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec

I agree we should not change external default behaviour and I already proposed to make sure we keep previous/legacy behaviour in that respect and not have any current input modified to append a delimiter, but instead make that behaviour optional and keep the legacy behaviour as default.

forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer

The thing is that this codec, in its current released state is actually in-between between line-oriented and text-bytes. It was actually meant as text-bytes since it was doing a split("\n") on the input so it was assuming a blob of line delimited text.

  def decode(text, &block)
    text = @converter.convert(text)
    text.split("\n").each do |line|
      ...

But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Furthermore, introducing the buffered tokenizer essentially does a .split(@delimiter) on the data and will not incur double buffering and should not be more costly on line oriented data.

On the other hand, using the buffered tokenizer in the context of text-bytes will actually provide the correct expected behaviour of supporting cross-blocks lines.

I still believe that using the buffered tokenizer and respecting the current behaviour is the right thing to do since it will make this codec work correctly the way it was intended to and with all types of inputs without any performance penalty.

@andrewvc @guy I agree we have a contract confusion in that respect between inputs and codecs and we should address this in a separate conversation and have a proper design discussion around it. Until we sort this out I believe this PR is correct and will provide value.

colinsurprenant commented 8 years ago

I have reverted to default behaviour of assuming line-based input data (and all original tests are passing as-is) and added this new config option:

# Assume data received from input plugin as line based. For some input plugins
# like stdin or tcp/udp data is not line based and this option should be set to false.
config :line_based_input, :validate => :boolean, :default  => true

I believe this is an improvement as it makes the plugin more flexible and fixes a bug which made it not work correctly for non lined-based input.

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec, I suggest to move it into a module and include that module. that way we will avoid mixin the "real" codec public interface with the identity map support interface.

colinsurprenant commented 8 years ago

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized. I am surprised we did not catch this in reviews, or was it assumed that it would always run within the identity mapper?

guyboertje commented 8 years ago

@colinsurprenant

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec

  • Please understand that the implementation of identity mapping and then later auto-flush was done MVP style so we could iterate quickly by releasing new versions of this library when problems were found like e.g the 20000 map limit.
  • You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed. But it has nothing to do with an identity map context; the IMC is transparent. It has everything to do with inputs needing to be modified to provide the listener mechanism. The listener implementation transfers the references to objects defined in the input held in the decode callback block binding to a listener and establishes a callback API on the listener such that callbacks are by method call and not block call.

colinsurprenant commented 8 years ago

@guyboertje

You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

At this stage my concern is not specifically about removing this code from this plugin repo (I agree it should eventually) but the IMC support code in the MLC code like the accept method and listener stuff. I am suggesting that this code should be factored out of the MLC code and put into a module so that the MLC code looks like a proper codec with the usual interface methods implemented. having this IMC support code in the IMC code makes it very confusing to understand the logic of what is going on.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed.

I am sorry about the confusion. I am talking about the auto_flush code itself and what I understood is the bug you pointed to was specifically calling auto_flush from the close method. What I am saying is that auto_flush itself depends on the listener stuff which is only relevant in the context of the IMC but auto_flush will run even if not in the context of the IMC.

the IMC is transparent.

I am not sure what you mean by "transparent"? It is not transparent in the MLC code, the accept method are required to support the IMC, plus the listener support code no?

jsvd commented 8 years ago

@colinsurprenant have you considered implementing this with logstash-codec-line instead of implementing buftok directly? This is already done in other plugins.

colinsurprenant commented 8 years ago

@jsvd actually I thinks it is a bad idea, see https://github.com/logstash-plugins/logstash-codec-json_lines/issues/17

colinsurprenant commented 8 years ago

putting this PR on-hold for @guyboertje upcoming refactor PR of the auto flushing and timers tasks.

jsvd commented 8 years ago

is this still blocked?

guyboertje commented 8 years ago

I would think that Event Mills coming soon, will obsolete this.

cwurm commented 7 years ago

@guyboertje are Event Mills still coming?

If not, we could pick this back up.

jordansissel commented 7 years ago

@cwurm eventually mills may arrive.

Ultimately, this problem is due to a mismatch between how the file input works (what the multiline codec was intended to target) and how things like tcp input work. The file input already emits lines.

Fixing this before mills arrive (if/when they arrive) is possible.

cwurm commented 7 years ago

@jordansissel I think we should, there have been issues and PRs dating back to 2015: https://github.com/logstash-plugins/logstash-codec-multiline/pull/38 https://github.com/logstash-plugins/logstash-codec-multiline/issues/37, https://github.com/logstash-plugins/logstash-codec-multiline/issues/14, https://github.com/logstash-plugins/logstash-codec-multiline/pull/6

@colinsurprenant Would you be interested to pick this back up? If not, I'd see if I can find the time for it.

colinsurprenant commented 5 years ago

@cwurm just did in #63