`message_stop_indicator` for file source

anton-ryzhov commented 4 years ago

In addition to message_start_indicator, opposite parameter can be very useful.

For example, docker json log driver splits long lines by 16k. And only very last chunk contains \r\n.

{"log":"First line\r\n","stream":"stdout","time":"2019-12-16T17:25:16.44039104Z"}
{"log":"Second...","stream":"stdout","time":"2019-12-16T17:28:28.448678893Z"}
{"log":"... line\r\n","stream":"stdout","time":"2019-12-16T17:28:28.448700857Z"}
{"log":"Third line\r\n","stream":"stdout","time":"2019-12-16T17:28:28.448700857Z"}

By defining message_start_indicator = '\\n"' I'm getting "Second..." merged to first line rather than to its continuation.

So only difference is how to combine unmatched lines — to previous or to next matched line.

binarylogic commented 4 years ago

Thanks @anton-ryzhov, I was actually just discussing this exact problem on Docker with @MOZGIII. @MOZGIII what do you think about taking this as your next issue?

MOZGIII commented 4 years ago

Right, this is exactly the problem we discussed. I can work on it.

MOZGIII commented 4 years ago

message_start_indicator = '\\n"' is a smart trick, however, the proper fix for this issue should probably be implemented on a different level, to ensure it works with gelf driver too.

anton-ryzhov commented 4 years ago

@binarylogic @MOZGIII good to know.

Docker problem is more complex in the end. So I just started with this more universal feature request of complementary message split. It could be useful not only for docker.

Speaking about docker (seems it should be a separate ticket) — I feel this should be handled on the next level, as transform. But for doing this we need to split messages properly at "source" level. Detecting by '\\n"' looks like dirty hack for me.

Another reason to merge docker logs at transform step — docker logs could be read from journald and docker sources. And same logic should be applied. But we can't merge two events at transform step.

Is it possible to have transformation which will have access to current and previous events [from the same source] and will be able to keep/delete/replace them. Kind of reducing function & sliding window.

MOZGIII commented 4 years ago

So I just started with this more universal feature request of complementary message split.

Sounds good!

I was thinking about solving the particular case with docker where the inner log messages are JSON - in that case, it'd be possible to just use a streaming JSON parser and join across multiple messages until the parser completes or reaches an unexpected token. It may be possible to solve this issue more generically, so having a detailed feature request in the first place would be very valuable.

anton-ryzhov commented 4 years ago

I want to share my approach/workaround. Doesn't work as expected because of this ticket, but maybe it will help you to come up with a better solution.

[sources.docker_logs]
  type = "file"
  include = ["containers/*/*.log"]

  message_start_indicator = '\\n"'
  # Should be:
  # message_stop_indicator = '\\n"'

[transforms.newlines_to_array]
  type = "lua"
  inputs = ["docker_logs"]
  source = """
   event["message"] = "{\\"chunks\\": [" .. string.gsub(event["message"], "\\n", ",") .. "]}"
  """

[transforms.json_parser]
  type = "json_parser"
  inputs = ["newlines_to_array"]

[transforms.merge_chunks]
  type = "lua"
  inputs = ["json_parser"]
  source = """
    event["time"] = event["chunks[0].time"]
    event["stream"] = event["chunks[0].stream"]
    event["log"] = ""

    local i = 0
    while event["chunks[" .. i .. "].time"] do
      event["log"] = event["log"] .. event["chunks[" .. i .. "].log"]
      event["chunks[" .. i .. "].time"] = nil
      event["chunks[" .. i .. "].log"] = nil
      event["chunks[" .. i .. "].stream"] = nil
      i = i + 1
    end
  """

MOZGIII commented 4 years ago

Now, when #1504 landed, this should be possible to achieve what you're looking for here via a merge transform. We'll continue to look into this issue. We might integrate the merge transform into the file source to provide an built-in solution.

anton-ryzhov commented 4 years ago

When there is a universal merge transform — maybe sources shouldn't do that in general (but it's 100% reasonable for docker source).

So maybe idea of that issue is outdated now.

MOZGIII commented 4 years ago

It's an ongoing process to determine what sources should support built-in merging, and how.

With build-in merges, we looking for the following:

better defaults where applicable - docker is an obvious example, same goes for kubernetes; journald is not so trivial though, we can discuss it at #1719 if you're interested;
better performance - in theory, we can get significant performance wins in some cases if we implement merge transform at the source level;
better internal implementation - this is related to both points above, but a bit different; in essence, when you have a source that can produce multiple streams (like docker or journald) in the source internals those streams can be accessed individually with less overhead and reliability compared to using the stream_discriminant_fields; so, in those cases, we can provide an easier to understand configuration interface for the end-users - even if merging is not enabled by default for a particular source.

So, I guess this completes the list of items that we consider when we consider implementing merging at the source. I am under the impression that we definitely can't cover all possible needs by merging at sources, so we need a merge transform.

The current implementation has a design trait that will stash the partial events and won't release them until it can provide a merged one. Configuring this can be error-prone at times and can cause the transform to consume all system memory if used incorrectly. Another issue is that when the partial event is read from the source and gets stashed at the merge transform, the source is not aware that the event is not processed, and will not redeliver it in case vector restarts. This is the main problem we see with the design, and the reason we're currently only shipping the transform as a beta.

We probably should add it to the documentation, so people are aware.

MOZGIII commented 4 years ago

We'll be adding some new parameters to file source, taking inspiration from https://app.scalyr.com/help/parsing-logs#multiline

In particular, in addition to message_start_indicator, we'll add new parameters, based on how Scalyr does it:

continueThrough: all consecutive lines matching this pattern are included in the group. The first line (the line that matched the start pattern) does not need to match the continueThrough pattern. This is useful in cases such as a Java stack trace, where some indicator in the line (such as leading whitespace) indicates that it is an extension of the preceeding line.
continuePast: all consecutive lines matching this pattern, plus one additional line, are included in the group. This is useful in cases where a log message ends with a continuation marker, such as a backslash, indicating that the following line is part of the same message.
haltBefore: all consecutive lines not matching this pattern are included in the group. This is useful where a log line contains a marker indicating that it begins a new message.
haltWith: all consecutive lines, up to and including the first line matching this pattern, are included in the group. This is useful where a log line ends with a termination marker, such as a semicolon.

binarylogic commented 4 years ago

That makes sense to me. While we're adding these options, would it be possible to nest them under a new multiline table? I would like to keep the current message_start_indicator option, but simply remove it from the docs. Ex:

[sources.in]
  type = "file"

  [sources.in.multiline]
    start = "..."
    continue_through = "..."
    continue_past = "..."
    halt_before = "..."
    halt_with = "..."

What do you think? Also, notice the name is snake case.

MOZGIII commented 4 years ago

I like the nesting! I think it'll map to the internal implementation better than the way we're doing it right now. Regarding the exact config options set and names - definitely snake case, but having it as Scalyr does is a bit confusing - you have to specify a start pattern + exactly one of the continuation patterns. It'd be easier for everyone if we just have three fields:

start
continuation
mode - an enum of continue_through/continue_past/halt_before/halt_with

Do you like this?

binarylogic commented 4 years ago

I’m a fan. The ‘mode’ option looks good to me.

anton-ryzhov commented 4 years ago

But for halt mode it is not a continuation.

Don't just copy their config, you can make it much more intuitive.

continue and halt could be defined by mutual exclusive pattern definitions without explicit mode.

For continueThrough:

start = "^Error"
through = "^     at"

For haltWith:

start = "^Error"
stop = "Summary line"

haltBefore mode sounds for me like start pattern of the next message. If you met start again — just start a new event.

continuePast mode for backslash example could be stop="[^\\]$"

MOZGIII commented 4 years ago

Thanks, @anton-ryzhov!

I've started working on the implementation, and I too noticed that we can probably do it even better than with just mode. I'm working with the mode field for now, and I think the better configuration parameters will emerge in the process.

Your point is definitely valid, and it was an overlook at the initial design pass. What you're proposing looks better, and I'll consider modeling the configuration like that - cause it seems to also be a better fit for the implementation needs.

By the way, I've added a couple of real-world test cases, and I'm looking to add more to make our implementation easier to maintain.

For example, for Java stack traces I have this:

let lines = vec![
    "java.lang.Exception",
    "    at com.foo.bar(bar.java:123)",
    "    at com.foo.baz(baz.java:456)",
];
let config = Config {
    start_pattern: Regex::new("^[^\\s]").unwrap(),
    condition_pattern: Regex::new("^[\\s]+at").unwrap(),
    mode: Mode::ContinueThrough,
    // ...
};
let expected = vec![
    concat!(
        "java.lang.Exception\n",
        "    at com.foo.bar(bar.java:123)\n",
        "    at com.foo.baz(baz.java:456)",
    ),
];
run_and_assert(&lines, config, &expected);

I also have a use case test for Ruby stack traces.

If you'd like to, now would be a good time to share the real-world usage examples that are important to you - and we add them as test cases too!

MOZGIII commented 4 years ago

Hey @anton-ryzhov, a follow up after we merged the PR! We ended up going with start_pattern, condition_pattern, mode and timeout_ms fields.

I'll continue with a follow-up issue on improving the configuration fields.

vectordotdev / vector

`message_stop_indicator` for file source #1431