Open PhaedrusTheGreek opened 6 years ago
I'm confused. This is not a buffer. We do a 16kb read and pass it to the codec. This is how tcp input works also. Can we add a way to reproduce this?
Yep, I will pass you a reproduction offline
I try to reproduce, but I don't know enough in the described behavior to know how to reproduce what you are talking about.
Here's my attempt:
Write a stdin that is one line long, larger than 16kb. Have Logstash write this line to a file.
% rm /tmp/x; ruby -e 'puts "x" * (16384+10) + "\n"' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
Check the file:
% wc -l /tmp/x
1 /tmp/x
% wc -c /tmp/x
16395 /tmp/x
1 line, 16395 bytes.
% ruby -e 'puts "x" * (16384+10) + "\n"' | wc -c
16395
Identical to the input. Longer than 16kb
This is not a buffer.
It is a buffer: https://github.com/colinsurprenant/jruby-stdin-channel/blob/master/src/main/java/com/jrubystdinchannel/StdinChannelLibrary.java#L93
@jakelandis let me rephrase. We don't buffer two reads together. We do 1 read, then ship that data into the codec immediately. The contents of the buffer are (should be?) lost after that because the codec has done its work.
Buffering, in this context, I am meaning this strategy:
data = read(16384)
data << read(16384)
data << read(16384)
^^ data
is buffering multiple reads.
In stdin, we do this:
1) data = read() 2) codec.decode(data) 3 goto 1
This is not a bug in stdin. It's not stated in this issue, but I believe the context is the use of the multiline codec with stdin.
And yes, stdin+multiline is broken. It's a bug caused by the fact that multiline was primarily written for the file input, and the file input itself is already line-oriented. stdin is not.
examples of this working with stdin + line codec:
# Write 10 large lines (>16kb per line)
% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
# check the result
% wc -l /tmp/x
10 /tmp/x
Now let's test with multiline codec:
% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => multiline { pattern => "^\\s+" what => previous } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
19 /tmp/x
^^ 19 lines? There are clearly only 10 lines provided on input, and none match the /^\s+/
multiline, so there should be 10 lines output.
Strange!
The tcp
input works the same way (read a chunk of bytes, hand to codec):
% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => multiline { pattern => "^\\s+" what => previous } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | nc localhost 10000
...
% wc -l /tmp/x
34 /tmp/x
But notice, if I just use the line
codec on tcp, it works correctly:
% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
10 /tmp/x
This is a bug in the way the multiline codec was designed to work only with the file input which has a bug in the way it provides data to Logstash (this is an ancient issue, but we can fix it now if desired).
The bug is in the multiline codec which assumes each decode() call a complete event (or a complete sequence of events) -- https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198
The multiline codec causes this bug because it makes some false assumptions (this is a really old bug, from what I remember).
def decode(text, &block)
text = @converter.convert(text)
text.split("\n").each do |line|
This causes a bug because a partial read (of a full event) will be sent like this:
If we said "hello world" and only read "hello" so far:
decode("hello")
which calls "hello".split("\n")
which returns ["hello"]
as the "line".
The multiline codec needs to buffer until it has at least a full line, but it does not do this.
The line codec does this, for what it's worth.
Historically, the multiline codec was primarily focused on the file input which is why I think it has this behavior; the file input only emits one whole line at a time, so it accidentally dodges this bug in the multiline codec.
Yes, to echo what @jordansissel said: this is a not a stdin input problem but a multiline codec problem: This was the interpretation of the original problem which triggered the creation of this issue by @PhaedrusTheGreek :
text.split("\n").each do |line|
on the buffer, the last incomplete line that fell on the 16k boundary has now been converted to a proper string and added to the multiline buffer hereWhich explains the rogue \n
and the resulting invalid json. Obviously changing the buffer size in not the solution here.
Our long standing milling concept would the long term solution to this I believe otherwise, as @jordansissel mentioned maybe just adding line buffering in the multiline code using the BufferedTokenizer
just like the line coded would work here?
Ran into the same issue today. I am making a project that uses the stdin and multiline plugin. A lot of lines are rejected by my grok filter because of the newlines. Any new info about this issue?
Fyi I just ran in to this issue too. I added detail comments in #37
Until we move forward with proper boundary detection across inputs/codecs, I am proposing this solution to the multiline codec logstash-plugins/logstash-codec-multiline#63
After 16kb input, a line break is inserted into the messages due to a buffer limit:
per @jakelandis:
This problem makes it difficult to use STDIN as a "read files and exit logstash" use case.