vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.52k stars 1.54k forks source link

Regex transform failing occasionally #2080

Closed gtie closed 1 year ago

gtie commented 4 years ago

Occasionally (on a million or two log lines), Vector would reporting an error like:

Mar 17 07:47:23 m80 vector[22684]: Mar 17 07:47:23.645  WARN transform{name=tinydns-anycast-regex type=regex}: vector::transforms::regex_parser: 1 "Regex pattern failed to match." events were rate limited. rate_limit_secs=5
Mar 17 07:47:23 m80 vector[22684]: Mar 17 07:47:23.645  WARN transform{name=tinydns-anycast-regex type=regex}: vector::transforms::regex_parser: Regex pattern failed to match. field="0000000000000ffff0d39b447:1cce:fb3b + 000c 92.143.112.1[...]" rate_limit_secs=30

The log line that it is getting stuck on is perfectly formatted in the log file. Due to specifics of our setup, each user query results in two log entries with timestamp being the only difference between them. One of the log lines makes it out to the final sink, and the other generates the error above (and is lost).

The twin logs lines looke like this:

@400000005e7080952657f0dc 00000000000000000000ffff0d39b447:1cce:fb3b + 000c 92.143.112.146.in-addr.arpa
@400000005e708095265bb5b4 00000000000000000000ffff0d39b447:1cce:fb3b + 000c 92.143.112.146.in-addr.arpa

The vector.toml source/transform configuration that is processing them looks like this:

data_dir = "/var/lib/vector"

[sources.tinydns-anycast-file]
  type = "file"
  include = ["/var/log/tinydns-anycast/current"]

# Parse out the data to distinct fields
[transforms.tinydns-anycast-regex]
  type = "regex_parser"
  inputs = ["tinydns-anycast-file"]
  regex = '\s*(?P<tai64n>\S+) (?P<ip>[^:]+):(?P<port>[^:]+):(?P<id>[^: ]+) ((?P<res>\S) )?(?P<type>\S+) (?P<qry>\S+)\s*'
"""

<snip>

FWIW, the errors/missed lines occur hours away from the rotation of the input file (/var/log/tinydns-anycast/current in this case).

Any ideas why the transform might be failing and how to avoid those failures?

ghost commented 4 years ago

I wonder would upgrading regex dependency help with this. Currently we the version in Cargo.lock is 1.3.4 and the latest version on crates.io is 1.3.5.

binarylogic commented 4 years ago

Thanks, @gtie and apologies for the trouble. Nothing jumps out to me, and as @a-rodin pointed out, I think it is worth upgrading our regex dependency. @a-rodin does anything jump out to you in the regex changelog?

That said, I think we should consider setting this up as a reliability test in our test harness. Reliability tests run indefinitely and report on errors. Maybe we can reproduce the error that way. @gtie would you be willing to share a sample log file that we could run this test against? Otherwise, we'll use the 2 lines listed above.

ghost commented 4 years ago

The only significant new change in regex crate is adding support for Unicode 13.

However, I tried to reproduce the the issue using stdin source and console sink with the two lines listed above and they were parsed successfully. Because of that I think it is possible that the issue is caused by file source rather than by regex transform. It might have something to do with file writes being flushed without padding over line delimiters, so that sometimes the produced log lines are truncated, and thus cannot be parsed.

gtie commented 4 years ago

@binarylogic, unfortunately, the full log files have too much sensiteve data to be shared. I'd suspect that having a file source + the above regex transform with a few M lines per day would be enough to have this triggered in a day or so.

@a-rodin, it does indeed make sense for file to be the culprit here, feeding regex with incomplete lines. Checking the WARN message closely reveals that the beginning of the line (the tai64n timestamp) is not actually printed in this case - the printout starts from the second field.

Perhaps a good start would be to increase the length of the line printed out in the WARN message, to confirm what is the string that acually makes it to the regex transform?

P.S. Thanks a lot for your quick review!

gtie commented 4 years ago

FYI, this is still an issue even with 0.10.0, happening a couple of times per hour:

Aug 22 03:18:17 m82 vector[22916]: Aug 22 03:18:17.907  WARN transform{name=tinydns-anycast-regex type=regex}: vector::internal_events::regex: regex pattern failed to match. field="@400000005f408e8335eeb1ac 2620011900300000000000000" rate_limit_secs=30
Aug 22 03:18:17 m82 vector[22916]: Aug 22 03:18:17.907  WARN transform{name=tinydns-anycast-regex type=regex}: vector::internal_events::regex: "regex pattern failed to match." is being rate limited. rate_limit_secs=5
Aug 22 03:28:57 m82 vector[22916]: Aug 22 03:28:57.687  WARN transform{name=tinydns-anycast-regex type=regex}: vector::internal_events::regex: 2 "regex pattern failed to match." events were rate limited. rate_limit_secs=5
Aug 22 03:47:50 m82 vector[22916]: Aug 22 03:47:50.784  INFO source{name=tinydns-anycast-file type=file}:file_server: file_source::file_server: Found file to watch. path="/var/log/tinydns-anycast/current" file_position=0
Aug 22 04:17:59 m82 vector[22916]: Aug 22 04:17:59.581  WARN source{name=tinydns-anycast-file type=file}:file_server: file_source::file_server: Ignoring file smaller than fingerprint_bytes file="/var/log/tinydns-anycast/current"
Aug 22 04:18:00 m82 vector[22916]: Aug 22 04:18:00.769  INFO source{name=tinydns-anycast-file type=file}:file_server: file_source::file_server: Found file to watch. path="/var/log/tinydns-anycast/current" file_position=0
Aug 22 04:47:09 m82 vector[22916]: Aug 22 04:47:09.015  INFO source{name=tinydns-anycast-file type=file}:file_server: file_source::file_server: Found file to watch. path="/var/log/tinydns-anycast/current" file_position=0
Aug 22 05:16:31 m82 vector[22916]: Aug 22 05:16:31.345  INFO source{name=tinydns-anycast-file type=file}:file_server: file_source::file_server: Found file to watch. path="/var/log/tinydns-anycast/current" file_position=0
Aug 22 05:41:19 m82 vector[22916]: Aug 22 05:41:19.444  WARN transform{name=tinydns-anycast-regex type=regex}: vector::internal_events::regex: 1 "regex pattern failed to match." events were rate limited. rate_limit_secs=5
Aug 22 05:46:58 m82 vector[22916]: Aug 22 05:46:58.148  WARN transform{name=tinydns-anycast-regex type=regex}: vector::internal_events::regex: regex pattern failed to match. field="@400000005f40b15c0" rate_limit_secs=30

The times of the WARN events (the errors) generally differ from the INFO events (file rotations). I suspect this might be the result of reads done before the full line is flushed to disk.

Since the file sink has max_line_bytes parameter, it should be possible to have at least a band-aid solution by adding "MIN_line_bytes" - i.e. do not consider this line full, even if you encouter EOF, while it is below that number of bytes.

Another possible solution would be to "ignore" EOF as a line separator - insist on reading until an EOL, or until file rotates.

binarylogic commented 3 years ago

@gtie just updating on this. I think the best solution here is to finish the dead-letter queue work so you can inspect events that fail. And the regex_parser transform is being replaced with the new remap transform (the parse_regex function specifically).

splitice commented 2 years ago

Can confirm this is still an issue with 0.18.1.

2021-12-17T15:07:19.383312Z WARN vector::internal_events::regex_parser: Regex pattern failed to match. field="an upstream response is buffered to a temporary file /v[...]" internal_log_rate_secs=30

Perhaps either the regex_parser transform should be removed (sad) or fixed?

jszwedko commented 2 years ago

As others have mentioned, I think the issue here is not the regex parsing, but rather getting a partial line from the file source.

@splitice are you also using a file source?

splitice commented 2 years ago

Not file source, vector source.

jszwedko commented 2 years ago

Not file source, vector source.

What is the original source of the data that is being forwarded over the vector source/sink?

splitice commented 2 years ago

UDP syslog

P.s am aware of nginx parsing functions (at the moment just porting over existing pipeline, will look into improvements like that later)

jszwedko commented 2 years ago

I see, thanks! It sounds like your issue may be different than the one reported here. In this issue it seems like Vector may be reading partial lines from the file source. In your case, it could be that the events are being truncated to fit into UDP packet length? Are you confident the packets have the full lines in them? Feel free to open a separate issue with more details.

As noted in this issue, the regex_parser transform, was replaced by remap. You can leverage that to get more details about the message that failed to parse with VRL like:

parsed, err = parse_regex(r'some regex', .message)
if err != null {
  log(.) // log the event that failed to parse
}
splitice commented 2 years ago

@jszwedko aware (and that's what I've been doing over the past week). Although I didnt think of ever doing log(.) neat.

regex_parser has one feature though thats very handy, sequential array execution. When ported to remap we ended up with ~30 ?? seperated parse_regex calls for the same effect which reads a bit less optimally.

StephenWakely commented 1 year ago

In your case, it could be that the events are being truncated to fit into UDP packet length?

It may be worth noting that Vector now drops the packet if a UDP event is truncated. I wonder if this means the problem no longer occurs here.

StephenWakely commented 1 year ago

I am going to close this issue because:

There may still be an issue with the file source. If this crops up again we will either reopen this issue or create a new issue specifically for the file source.