vectordotdev / vrl

Vector Remap Language
Mozilla Public License 2.0
132 stars 61 forks source link

`logfmt` does not conform to spec #62

Open jalaziz opened 2 years ago

jalaziz commented 2 years ago

A note for the community

Problem

Even with vectordotdev/vector#12700, Vector's logfmt encoding does not conform to the semi-official logfmt spec.

In particular, the handling of quoted keys and values is still off. Some example scenarios are:

  1. Keys that include whitespace, equal signs, and quotes. These are all technically invalid according to the spec. The go-logfmt library seems to handle this by removing the invalid characters.
  2. Quoted keys. Related to #1. Keys should never be quoted.
  3. Values containing equal signs. These values should be quoted.

Configuration

[sources.in]
type = "stdin"

[transforms.modify]
type = "remap"
inputs = ["in"]
source = '''
  . = parse_json!(.message)
  . = encode_logfmt(.)
'''

[sinks.out]
inputs = ["modify"]
type = "console"
encoding.codec = "text"

Version

0.22

Debug Output

No response

Example Data

{"test key": "test value", "test=key": "test=value", "test\"key": "test\"value"}

Additional Context

I've put together some basic tests of go-logfmt in this Go playground.

The current implementation of logfmt in Vector relies on encode_key_value. Despite logfmt being the primary use case for encode_key_value, fixing these issues seems logfmt-specific. One option might be preprocessing keys and values to match the spec?

Additionally, even though go-logfmt drops invalid characters from keys, other options may be more desirable (e.g. replacing invalid characters with underscores).

References

vectordotdev/vector#12674

jszwedko commented 2 years ago

Thanks for this thorough evaluation @jalaziz !

  • Keys that include whitespace, equal signs, and quotes. These are all technically invalid according to the spec. The go-logfmt seems to handle this by removing the invalid characters.
  • Quoted keys. Related to #1. Keys should never be quoted.
  • Values containing equal signs. These values should be quoted.

Interesting, yeah, that makes sense. For encode_key_value I think it would make sense to quote keys and values that contain the key_value_delimiter (in logfmt's case, =) or field_delimiter (in logfmt's case, space) characters.

But for encode_logfmt we should choose an option like character replacement or simply dropping invalid characters. This also leaves the issue open that other whitespace (like tabs) would not be quoted.

This is making me thing that encode_logfmt (and parse_logfmt) is diverging enough from encode_key_value (and parse_key_value) that it should just have a specialized implementation rather than delegating to encode_key_value. I'm curious what you think though.

jalaziz commented 2 years ago

For encode_key_value I think it would make sense to quote keys and values that contain the key_value_delimiter (in logfmt's case, =) or field_delimiter (in logfmt's case, space) characters.

@jszwedko I was thinking the same thing last night.

This is making me thing that encode_logfmt (and parse_logfmt) is diverging enough from encode_key_value (and parse_key_value) that it should just have a specialized implementation rather than delegating to encode_key_value. I'm curious what you think though.

I thought the same and started looking into this approach. Looking at the code, all the logic to walk through keys and values is within encode_key_value and it seemed like there would be quite a bit of duplication (or refactor work -- which seemed scary given my complete lack of Rust experience). So after giving it some thought, a simpler option might be to add "key preprocessing" and "value preprocessing" lambdas as arguments to encode_key_value (or some extracted inner version to keep the original signature intact).

If the key is preprocessed to remove/replace invalid chars (such that quotes are never needed) and we pre-quote values as needed, then the rest of the encode_key_value logic should be fine. The most complicated part of this approach is the VRL version of encode_log_fmt calls onto the VRL version of encode_key_value instead of the vector-common version, so that may need to be split up.

jalaziz commented 2 years ago

@jszwedko Any thoughts on this? I'm new to rust, but wouldn't mind trying to contribute the simpler "lambda" version if you think that's the way to go.

jszwedko commented 2 years ago

@jszwedko Any thoughts on this? I'm new to rust, but wouldn't mind trying to contribute the simpler "lambda" version if you think that's the way to go.

Hey! Apologies, I remember drafting a response, but I must have got interrupted before sending it.

I did a quick survey of the implementations linked to from https://brandur.org/logfmt and, what a mess 😅

Go

https://pkg.go.dev/github.com/kr/logfmt (it seems like the package you found is based on that one)

Only handles parsing.

Has a EBNF description:

ident_byte = any byte greater than ' ', excluding '=' and '"'
string_byte = any byte excluding '"' and '\'
garbage = !ident_byte
ident = ident_byte, { ident byte }
key = ident
value = ident | '"', { string_byte | '\', '"' }, '"'
pair = key, '=', value | key, '=' | key
message = { garbage, pair }, garbage

This seems pretty reasonable, formalized, and strict enough to be likely to be parsable by other implementation: all of which support parsing values with "s except the clojure one. It's my favorite option. I think we can strip characters from key that don't fit, rather than replacing (at least for this initial implementation).

Clojure

https://github.com/yeller/logfmt

Only handles generation.

Seems to not do any particular escaping or quoting of values.

Its keys are Clojure symbols, so the character set is restricted, but still pretty permissive: https://clojure.org/reference/reader#_symbols

Seems like a fairly naive implementation.

Python

https://github.com/jteppinette/python-logfmter/blob/main/logfmter/formatter.py

Only handles generation.

Quotes values that contain spaces, newlines, quotes, or equals; escaping quotes and newline characters.

Keys are stripped of newlines or spaces, but nothing else (e.g. =s).

Ruby

https://github.com/cyberdelia/logfmt-ruby/blob/master/lib/logfmt/parser.rb

Only handles parsing.

Seems to be modeled after the https://pkg.go.dev/github.com/kr/logfmt , parsing restricted keys and quoted values.

Implementation

With regards to implementation, I think we could accept an approach using closures to pre-process the keys and values, and then pass along to encode_key_value, with sufficient testing at the encode_logfmt level. I agree there would be sufficient duplication if we did separate the implementations so it would be nice re-share some logic if we did separate them. I recognize you mentioned you were new to Rust, so it would be a lot take on :smile: and I think, again with sufficient tests, that your proposed preprocessing approach could suffice.

jalaziz commented 2 years ago

I did a quick survey of the implementations linked to from https://brandur.org/logfmt and, what a mess 😅

Not surprised actually. logfmt has no formal spec, so edge cases are bound to slip through.

I actually discovered the https://pkg.go.dev/github.com/kr/logfmt package because Vector actually links to it as the logfmt spec. However, we use Loki which appears to use go-logfmt and which also happens to refer to https://pkg.go.dev/github.com/kr/logfmt as the "most authoritative public specification to date".

All this is to say that I personally think it's safe to use that as the spec as it appears to be the strictest.

With regards to implementation, I think we could accept an approach using closures to pre-process the keys and values, and then pass along to encode_key_value, with sufficient testing at the encode_logfmt level. I agree there would be sufficient duplication if we did separate the implementations so it would be nice re-share some logic if we did separate them. I recognize you mentioned you were new to Rust, so it would be a lot take on 😄 and I think, again with sufficient tests, that your proposed preprocessing approach could suffice.

Great. Can't promise this will happen anytime this week, but will certainly give it a shot soon.

Side note: Seems like Rust could use a fully featured logfmt crate.

jszwedko commented 2 years ago

Side note: Seems like Rust could use a fully features logfmt crate.

It looks like there is actually a crate from the author of https://brandur.org/logfmt, themselves, for parsing: https://github.com/brandur/logfmt . It has a relatively simple implementation that doesn't interact with serde as ours does. Probably a good additional reference though. I wouldn't be opposed to us extracting our implementation to create a new crate for encoding and decoding logfmt with serde. I could see other Rust projects finding it useful.

gadisn commented 7 months ago

Hi @jalaziz @jszwedko, did anything changed on this? I still see that spaces are allowed when parsing logfmt. Would this be a good first contribution I could try helping with?

jszwedko commented 7 months ago

Hi @jalaziz @jszwedko, did anything changed on this? I still see that spaces are allowed when parsing logfmt. Would this be a good first contribution I could try helping with?

Hey! Yeah, nothing has changed around this yet. We'd be happy to see a contribution! See the above comments for ideas on what the implementation should look like (likely a specialized parser rather than wrapping parse_key_value).