OOM when using the json_lines codec

colinsurprenant commented 6 years ago

The json_lines codec used in the file input crashes logstash with an OOM.

To reproduce:

bin/logstash -e 'input{file{codec => json_lines path => "/path/to/bigfile.json" start_position => "beginning"}} output {stdout{codec => dots}}'

guyboertje commented 6 years ago

@colinsurprenant

We already know this. It is double boundary detection, once in the file input and again in any *_lines codec (except the \n boundary never comes), this and multiline is what sparked off the original Milling debate. That debate was de-railed by continuous delays via "after graph execution we can remove the codec altogether and everything is a filter" toing-and-froing.

@jsvd

Maybe it is time now to revisit this problem of boundary detection, multiline reassembly and codec mismatching.

colinsurprenant commented 6 years ago

@guyboertje I hear you - but OTOH why wouldn't the file input behave similarly to the stdin input in that respect?

"input line" + delimiter -> stdin input + line codec -> event with "input line" message

Why wouldn't this work with the file input, same codec?

guyboertje commented 6 years ago

I don't think the stdin input consumes the newline.

colinsurprenant commented 6 years ago

no it doesn't, it's the codec's job to do so hense the line codec default.

in #211 you mentioned "Add support to bypass the delimiter boundary detection" - it would probably work for this scenario?

From a conceptual perspective and in today's architecture reality I am not sure why the file input should behave differently and not use a line codec for example to deal with delimited lines, or any other codec in that respect for decoding. Isn't the current identity map logic in the file input good for mapping codecs with each stream identity?

guyboertje commented 6 years ago

I am not sure why the file input should behave differently and not use a line codec for example to deal with delimited lines

This is exactly what sparked the Mills mini pipeline inside an input convo.

Philosophically, the line codec is not a codec, it is a boundary detector and the multiline codec is a boundary detector that assembles lines subject to some boundary being detected. Codecs receive a logically bounded chunk of data and decode it and can be stateful (cef, netflow) - but a filter can do this too but not statefully.

Take the graphite codec - it is not a true codec at all, it is a line boundary detector with a mutate split and date filter equivalent. I guess it was conceived to remove the need to know to use line + mutate + date or line + graphite filter (if one was made).

The collectd codec is a true codec but it expects that the input is supplying logical "lines" of data.

The cef codec can optionally do boundary detection if a delimiter is specified. Such conditional operation means if we are to have a codec advertise whether it will handle the boundary detection, this needs to be dynamic.

The multiline codec can be changed to accept chunks and not lines.

I suppose the question is to what degree are inputs supposed to supply transactional units of data? Even if those units contain multiple "lines" they should not have it that some of this unit's data "belongs" to the next unit in the stream (as would happen if the file input made no attempt to boundary detect - by, say, holding back the data that follows the last delimiter in a chunk and then appending the next chunk to it).

colinsurprenant commented 6 years ago

@guyboertje I get what you are saying and I am +1 on reopening this discussion but I don't think here is the best best place to have it.

With what we have today in terms of input/codec architecture, I still don't see why we couldn't use a line or json_lines codec with the file input or any other codec that can be used with the stdin, tcp, udp inputs for example.

andsel commented 1 month ago

To test this consider the test plan explained here also without limitation on Xmx, the actual default (1GB) makes it crash with 1GB file.

Change the input section, eliminating the codec part it's useless for this test, should obtain something like:

input {
  file {
    path => "/path/to/big_single_line.json"
    sincedb_path => "/tmp/sincedb"
    mode => "read"
    file_completed_action => "log"
    file_completed_log_path => "/tmp/processed.log"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

logstash-plugins / logstash-input-file

OOM when using the json_lines codec #210