16k buffer limit causes \n in message

PhaedrusTheGreek commented 6 years ago

After 16kb input, a line break is inserted into the messages due to a buffer limit:

per @jakelandis:

It is the stdin which uses a 2^14 buffer here.

This problem makes it difficult to use STDIN as a "read files and exit logstash" use case.

jordansissel commented 6 years ago

I'm confused. This is not a buffer. We do a 16kb read and pass it to the codec. This is how tcp input works also. Can we add a way to reproduce this?

PhaedrusTheGreek commented 6 years ago

Yep, I will pass you a reproduction offline

jordansissel commented 6 years ago

I try to reproduce, but I don't know enough in the described behavior to know how to reproduce what you are talking about.

Here's my attempt:

Write a stdin that is one line long, larger than 16kb. Have Logstash write this line to a file.

% rm /tmp/x; ruby -e 'puts "x" * (16384+10) + "\n"' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'

Check the file:

% wc -l /tmp/x
1 /tmp/x
% wc -c /tmp/x
16395 /tmp/x

1 line, 16395 bytes.

% ruby -e 'puts "x" * (16384+10) + "\n"' | wc -c
16395

Identical to the input. Longer than 16kb

jakelandis commented 6 years ago

This is not a buffer.

It is a buffer: https://github.com/colinsurprenant/jruby-stdin-channel/blob/master/src/main/java/com/jrubystdinchannel/StdinChannelLibrary.java#L93

jordansissel commented 6 years ago

@jakelandis let me rephrase. We don't buffer two reads together. We do 1 read, then ship that data into the codec immediately. The contents of the buffer are (should be?) lost after that because the codec has done its work.

Buffering, in this context, I am meaning this strategy:

data = read(16384)
data << read(16384)
data << read(16384)

^^ data is buffering multiple reads.

In stdin, we do this:

1) data = read() 2) codec.decode(data) 3 goto 1

jordansissel commented 6 years ago

This is not a bug in stdin. It's not stated in this issue, but I believe the context is the use of the multiline codec with stdin.

And yes, stdin+multiline is broken. It's a bug caused by the fact that multiline was primarily written for the file input, and the file input itself is already line-oriented. stdin is not.

examples of this working with stdin + line codec:

# Write 10 large lines (>16kb per line)
% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'

# check the result
% wc -l /tmp/x
10 /tmp/x

Now let's test with multiline codec:

% rm /tmp/x; ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | bin/logstash -e 'input { stdin { codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
19 /tmp/x

^^ 19 lines? There are clearly only 10 lines provided on input, and none match the /^\s+/ multiline, so there should be 10 lines output.

Strange!

The tcp input works the same way (read a chunk of bytes, hand to codec):

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => multiline { pattern => "^\\s+" what => previous   } } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% ruby -e '10.times { puts "x" * (16384+10) + "\n" }' | nc localhost 10000

...
% wc -l /tmp/x
34 /tmp/x

But notice, if I just use the line codec on tcp, it works correctly:

% rm /tmp/x;bin/logstash -e 'input { tcp { port => 10000 codec => line } } output { file { path => "/tmp/x" codec => line { format => "%{message}" } } }'
...
% wc -l /tmp/x
10 /tmp/x

This is a bug in the way the multiline codec was designed to work only with the file input which has a bug in the way it provides data to Logstash (this is an ancient issue, but we can fix it now if desired).

jordansissel commented 6 years ago

The bug is in the multiline codec which assumes each decode() call a complete event (or a complete sequence of events) -- https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

jordansissel commented 6 years ago

The multiline codec causes this bug because it makes some false assumptions (this is a really old bug, from what I remember).

https://github.com/logstash-plugins/logstash-codec-multiline/blob/master/lib/logstash/codecs/multiline.rb#L198

  def decode(text, &block)
    text = @converter.convert(text)
text.split("\n").each do |line|

This causes a bug because a partial read (of a full event) will be sent like this:

If we said "hello world" and only read "hello" so far:

decode("hello")

which calls "hello".split("\n") which returns ["hello"] as the "line".

The multiline codec needs to buffer until it has at least a full line, but it does not do this.

The line codec does this, for what it's worth.

Historically, the multiline codec was primarily focused on the file input which is why I think it has this behavior; the file input only emits one whole line at a time, so it accidentally dodges this bug in the multiline codec.

colinsurprenant commented 6 years ago

Yes, to echo what @jordansissel said: this is a not a stdin input problem but a multiline codec problem: This was the interpretation of the original problem which triggered the creation of this issue by @PhaedrusTheGreek :

stdin reads a 16k block and passes it to the multiline codec here
multiline codec parses all the events correctly up the the last one which falls on the 16k block boundary
since the codec does a text.split("\n").each do |line| on the buffer, the last incomplete line that fell on the 16k boundary has now been converted to a proper string and added to the multiline buffer here

Which explains the rogue \n and the resulting invalid json. Obviously changing the buffer size in not the solution here.

Our long standing milling concept would the long term solution to this I believe otherwise, as @jordansissel mentioned maybe just adding line buffering in the multiline code using the BufferedTokenizer just like the line coded would work here?

Ghysbrecht commented 6 years ago

Ran into the same issue today. I am making a project that uses the stdin and multiline plugin. A lot of lines are rejected by my grok filter because of the newlines. Any new info about this issue?

edperry commented 6 years ago

Fyi I just ran in to this issue too. I added detail comments in #37

colinsurprenant commented 6 years ago

Until we move forward with proper boundary detection across inputs/codecs, I am proposing this solution to the multiline codec logstash-plugins/logstash-codec-multiline#63

logstash-plugins / logstash-input-stdin

16k buffer limit causes \n in message #16