Closed colinsurprenant closed 4 years ago
@jsvd @guyboertje I would appreciate your input!
Initial impressions: looks good. I think we should discuss this POC in EAH. I have no intentions of raising the original Milling concept. Perhaps we can talk about a pluggable boundary detector setting in true codecs.
@guyboertje my goal here is to offer a solution with what we have today. I am +1 on investigating for a better solution which could be applied to all inputs and codecs but -1 on waiting for it to be fleshed out. I believe this proposal is simple enough & BWC to be considered today. My guess is that whatever we decide for better codecs/inputs boundary detection, it will probably be a 7.0 feature.
What would be the potential problem in moving forward with this solution?
@colinsurprenant None I can see bar a test or two to verify the streaming flag.
This issue is critical for me now as the single way to collect my text multiline logs is to upload files as raw TCP.
Is the following is correct way to use code from this repo?
/usr/share/logstash/Gemfile
bin/logstash-plugin install --no-verify
(where is the plugin name?)logstash
.And there is another question that may seem unrelated.
If I use json
codec and upload it as raw TCP stream, will it work? If so, how do they solved this issue? Just based on syntax of JSON?
@sovetov you should be able to use the plugin from the repo by editing the Gemfile
file in logstash home as you suggested but there is no need to run bin/logstash-plugin ...
after, just restart logstash.
@sovetov I am not sure I understand your second question about the json
codec.
are you asking if the json
codec will support streaming data, no it won't, the json
codec expects a complete and valid json object when decoding and will not work will with tcp or udp stream.
On the other hand, with streaming input, you can use the json_lines
codec which uses a BufferedTokenizer and has a configurable delimiter.
bump, any objection in moving this forward @guyboertje @jsvd ?
Lets merge it. :shipit:
I faced this issue and to limit it while having multiline for stacktraces, I used mutate to remove new lines only for inputs different than stacktraces. Please merge the fix.
There is another use case where a codec does not interact well with line detection by the input. That is UTF-16. The file input will read half a character when it consumes the \n, leaving the rest of the file effectively flipped from UTF-16BE to UTF-16LE.
Hi, I've been banging my head into a wall trying to understand why my lines were being broken mid-line. I'm really glad I finally found this, as it seems to be the fix. Is there any ETA for merge/release?
Depending on what we decide in logstash-plugins/logstash-codec-csv#8 I'll followup here.
Opened https://github.com/elastic/logstash/issues/11885 for the broader discussion
closing, we can reopen when consensus will be reached on how to solve this.
NOTE This is a WIP to discuss this proposed strategy of using the
BufferedTokenizer
and configurable line delimiter to extract lines instead of using a hardcoded splitter on\n
.This is essentially a reboot of #26, it would solve #14, #37, #38, #57, logstash-plugins/logstash-input-stdin#16 and replace #6.
The Problem
For historical reasons and because of the ambiguity between line-oriented vs streaming inputs in our input/codec architecture, the
multiline
codec in its current state is actually an in-between for handling line-oriented and streaming data. It was actually meant for handling streaming line-delimited data since it was doing asplit("\n")
on the input thus assuming blobs of line delimited text. But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.Proposal
To correctly handle streaming input for delimited data, using the
BufferedTokenizer
and adding a configurable line delimiter will provide a similar functionality to theline
codec.Also adding a
streaming_input
config option (with afalse
default for BWC) will preserve current behaviour. Usingtrue
would provide support for streaming inputs such asstdin
,tcp
,udp
. I believe this is a pragmatic proposal in todays ambiguous input/codec architecture . My last attempt at solving this was in 2016 and it was suggested we wait on the Milling concept to land. I do not think we need to wait for that to make it work in a practical way in our current imperfect architecture.Current WIP State
Using
streaming_input => false
(default) will keep current behaviour.Using
streaming_input => true
will make it work with streaming inputs such asstdin
,tcp
,udp