[WIP] use BufferedTokenizer and configurable line delimiter

colinsurprenant commented 5 years ago

NOTE This is a WIP to discuss this proposed strategy of using the BufferedTokenizer and configurable line delimiter to extract lines instead of using a hardcoded splitter on \n.

This is essentially a reboot of #26, it would solve #14, #37, #38, #57, logstash-plugins/logstash-input-stdin#16 and replace #6.

The Problem

For historical reasons and because of the ambiguity between line-oriented vs streaming inputs in our input/codec architecture, the multiline codec in its current state is actually an in-between for handling line-oriented and streaming data. It was actually meant for handling streaming line-delimited data since it was doing a split("\n") on the input thus assuming blobs of line delimited text. But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Proposal

To correctly handle streaming input for delimited data, using the BufferedTokenizer and adding a configurable line delimiter will provide a similar functionality to the line codec.

Also adding a streaming_input config option (with a false default for BWC) will preserve current behaviour. Using true would provide support for streaming inputs such as stdin, tcp, udp. I believe this is a pragmatic proposal in todays ambiguous input/codec architecture . My last attempt at solving this was in 2016 and it was suggested we wait on the Milling concept to land. I do not think we need to wait for that to make it work in a practical way in our current imperfect architecture.

Current WIP State

Using streaming_input => false (default) will keep current behaviour.
Using streaming_input => true will make it work with streaming inputs such as stdin, tcp, udp

colinsurprenant commented 5 years ago

@jsvd @guyboertje I would appreciate your input!

guyboertje commented 5 years ago

Initial impressions: looks good. I think we should discuss this POC in EAH. I have no intentions of raising the original Milling concept. Perhaps we can talk about a pluggable boundary detector setting in true codecs.

colinsurprenant commented 5 years ago

@guyboertje my goal here is to offer a solution with what we have today. I am +1 on investigating for a better solution which could be applied to all inputs and codecs but -1 on waiting for it to be fleshed out. I believe this proposal is simple enough & BWC to be considered today. My guess is that whatever we decide for better codecs/inputs boundary detection, it will probably be a 7.0 feature.

colinsurprenant commented 5 years ago

What would be the potential problem in moving forward with this solution?

guyboertje commented 5 years ago

@colinsurprenant None I can see bar a test or two to verify the streaming flag.

sovetov commented 5 years ago

This issue is critical for me now as the single way to collect my text multiline logs is to upload files as raw TCP.

Is the following is correct way to use code from this repo?

Clone this repo (or my fork):
Edit /usr/share/logstash/Gemfile
gem "logstash-codec-multiline"
gem "logstash-codec-multiline", :path => "/home/george/logstash-codec-multiline"
Run bin/logstash-plugin install --no-verify (where is the plugin name?)
Restart logstash.

sovetov commented 5 years ago

And there is another question that may seem unrelated.

If I use json codec and upload it as raw TCP stream, will it work? If so, how do they solved this issue? Just based on syntax of JSON?

colinsurprenant commented 5 years ago

@sovetov you should be able to use the plugin from the repo by editing the Gemfile file in logstash home as you suggested but there is no need to run bin/logstash-plugin ... after, just restart logstash.

colinsurprenant commented 5 years ago

@sovetov I am not sure I understand your second question about the json codec. are you asking if the json codec will support streaming data, no it won't, the json codec expects a complete and valid json object when decoding and will not work will with tcp or udp stream.

On the other hand, with streaming input, you can use the json_lines codec which uses a BufferedTokenizer and has a configurable delimiter.

colinsurprenant commented 5 years ago

bump, any objection in moving this forward @guyboertje @jsvd ?

guyboertje commented 5 years ago

Lets merge it. :shipit:

remiville commented 4 years ago

I faced this issue and to limit it while having multiline for stacktraces, I used mutate to remove new lines only for inputs different than stacktraces. Please merge the fix.

TheVastyDeep commented 4 years ago

There is another use case where a codec does not interact well with line detection by the input. That is UTF-16. The file input will read half a character when it consumes the \n, leaving the rest of the file effectively flipped from UTF-16BE to UTF-16LE.

imnotteixeira commented 4 years ago

Hi, I've been banging my head into a wall trying to understand why my lines were being broken mid-line. I'm really glad I finally found this, as it seems to be the fix. Is there any ETA for merge/release?

colinsurprenant commented 4 years ago

Depending on what we decide in logstash-plugins/logstash-codec-csv#8 I'll followup here.

colinsurprenant commented 4 years ago

Opened https://github.com/elastic/logstash/issues/11885 for the broader discussion

colinsurprenant commented 4 years ago

closing, we can reopen when consensus will be reached on how to solve this.

logstash-plugins / logstash-codec-multiline