multiple multiline messages testcases don't work with --sockets flag

dawi commented 7 years ago

I have a problem, testing multiline messages with logstash filter verifier and I am not sure if it is a bug or intended behaviour. Either way, a section in the readme about testing multiline messages could help a lot.

I am using the "json" codec to test multiline messages.

The issue is, that if you use the --sockets flag to speed up the tests you cannot have more than one multiline test case per test file.

In this case you currently have two options:

Don't use the --sockets flag (which will result in slow tests)
Put each multiline test case in a separate file.

Is there a reason that it is not possible to have multiple multiline testcases in one file in case you use the --sockets flag?

magnusbaeck commented 7 years ago

Could you supply an example testcase file that exhibits the problem?

dawi commented 7 years ago

Yes of course, I will create one.

dawi commented 7 years ago

The attached testcases.zip contains one pipeline configuration and two test directories.

Directory tests1 contains one test file with two test cases. Directory tests2 contains the same two test cases but in two separate files.

testcases.zip

tests1 will run successfully without --sockets, but will fail with --sockets. tests2 will run always successfully.

magnusbaeck commented 7 years ago

Thanks, I'll have a look as soon as I can.

dawi commented 7 years ago

Many thanks for your efforts. :)

breml commented 7 years ago

@dawi could please you try again with codec json_lines instead of json. If it is still not working, please provide the error messages (set --loglevel to DEBUG and add --logstash-output).

I tried to quickly run your tests, but I failed, because you are using a quiet new feature of the grok filter (pattern_definitions) and I don't have such a recent version of logstash ready to run the tests.

dawi commented 7 years ago

Ok, it works with json_lines.

At the beginning I wanted to use json_lines, but maybe I used json_line instead of json_lines (which obviously cannot work) and came to the conclusion that I have to use json codec to be able to test multiline messages.

Anyway, the problem exists with json codec.

breml commented 7 years ago

@dawi true, but this is not resolvable due to the way, the plugin logstash-input-unix is working. The difference between logstash-input-stdin and logstash-input-unix is, that in https://github.com/logstash-plugins/logstash-input-stdin/blob/master/lib/logstash/inputs/stdin.rb#L37, the stdin plugin is reading the input line by line (without regard to the used codec) whereas in https://github.com/logstash-plugins/logstash-input-unix/blob/master/lib/logstash/inputs/unix.rb#L88 the unix input is reading available data chunks up to 16384 bytes, where the identification of events within those data chunks is completely left to the used codec. The json codec does not delimit the events on a line by line base, which is compensated by the stdin input as written above, but this is not the case for the unix input.

I suggest to close this issue, as it is working fine with json_lines codec.

dawi commented 7 years ago

Ok, I agree, but it would be good if the readme would be more explicit about this. I am wondering if there is any reason to use json instead of json_lines at all with logstash-filter-verifier. If not, then maybe the use of this codec this should be forbidden in logstash-filter-verifier or a warning could be printed.

breml commented 7 years ago

@dawi currently the readme states, that the codec normally should be one of line or json_lines (https://github.com/magnusbaeck/logstash-filter-verifier/blame/master/README.md#L202). Additionally there is a hint for the usage with --sockets, that in this case it is especially important to use either line or json_lines (https://github.com/magnusbaeck/logstash-filter-verifier/blame/master/README.md#L251). Also LFV defaults to line codec, which works in both cases (with or without --sockets).

What else do you have in mind? If you want the readme to be more explicit about this issue, maybe you create a PR.

dawi commented 7 years ago

@breml Yes, I will think about it. But I find it difficult to decide what make sense and what not, since I am just using logstash only for two weeks now. I am currently wondering if it does make sense to use LFV with any other codec then lineor json_lines. And if not, why not forbid the use of codecs that are known to cause errors in some cases?

magnusbaeck commented 7 years ago

Issuing a moratorium on other codecs is probably a mistake since someone's bound to figure out clever ways to make use of other codecs (possibly custom ones that we don't even know exist). However, warning users that the codec they've configured most likely isn't the best choice would be totally doable. What do you think?

breml commented 7 years ago

TL;DR: I think it is save to raise a warning if a user uses a codec other than logstash-codec-lines or logstash-codec-json_line together with --sockets.

In my opinion the main issue with the logstash-input-unix (as well as logstash-input-tcp) is, that it is not an application level protocol, which has a definition of a message, but rather a transport protocol, which transports a stream of data (message = log event in this case). It is the responsibility of the application layer protocol to define, when a message ends and the next message starts. So we actually use the codecs logstash-codec-line and logstash-codec-json_lines to split our data stream into messages (our "protocol" from LFV point-of-view is, each message is separated by a newline). The logstash-input-stdin in this regard acts quite similar to an application layer protocol, because every line of input is automatically considered a message.

This means, that all the codec, which assume to get the messages already properly separated (e.g. logstash-codec-csv, logstash-codec-compress_spooler) will not work in our current setup.

There is an other problem: LFV does not allow to configure the codec plugin, which means, our "application layer protocol" (each message on a line) must be supported by the codec by default. For example, the logstash-codec-cef would allow to configure a delimiter (which could be \n), but by default there is none set, which means, that this codec does also not work with LFV. So in the end, I think there are only a few codecs, which possibly could work with LFV at the moment:

logstash-codec-gzip_lines
logstash-codec-es_bulk
logstash-codec-graphite
logstash-codec-edn_lines

So, I do not expect the majority of the codecs to currently work with LFV.

magnusbaeck commented 7 years ago

Thanks for the analysis @breml! I've pushed a commit that adds a warning when select codecs are used.

magnusbaeck / logstash-filter-verifier

multiple multiline messages testcases don't work with --sockets flag #39