Open jalaziz opened 2 years ago
Thanks for this thorough evaluation @jalaziz !
- Keys that include whitespace, equal signs, and quotes. These are all technically invalid according to the spec. The go-logfmt seems to handle this by removing the invalid characters.
- Quoted keys. Related to #1. Keys should never be quoted.
- Values containing equal signs. These values should be quoted.
Interesting, yeah, that makes sense. For encode_key_value
I think it would make sense to quote keys and values that contain the key_value_delimiter
(in logfmt
's case, =
) or field_delimiter
(in logfmt
's case, space) characters.
But for encode_logfmt
we should choose an option like character replacement or simply dropping invalid characters. This also leaves the issue open that other whitespace (like tabs) would not be quoted.
This is making me thing that encode_logfmt
(and parse_logfmt
) is diverging enough from encode_key_value
(and parse_key_value
) that it should just have a specialized implementation rather than delegating to encode_key_value
. I'm curious what you think though.
For
encode_key_value
I think it would make sense to quote keys and values that contain thekey_value_delimiter
(inlogfmt
's case,=
) orfield_delimiter
(inlogfmt
's case, space) characters.
@jszwedko I was thinking the same thing last night.
This is making me thing that
encode_logfmt
(andparse_logfmt
) is diverging enough fromencode_key_value
(andparse_key_value
) that it should just have a specialized implementation rather than delegating toencode_key_value
. I'm curious what you think though.
I thought the same and started looking into this approach. Looking at the code, all the logic to walk through keys and values is within encode_key_value
and it seemed like there would be quite a bit of duplication (or refactor work -- which seemed scary given my complete lack of Rust experience). So after giving it some thought, a simpler option might be to add "key preprocessing" and "value preprocessing" lambdas as arguments to encode_key_value
(or some extracted inner version to keep the original signature intact).
If the key is preprocessed to remove/replace invalid chars (such that quotes are never needed) and we pre-quote values as needed, then the rest of the encode_key_value
logic should be fine. The most complicated part of this approach is the VRL version of encode_log_fmt
calls onto the VRL version of encode_key_value
instead of the vector-common
version, so that may need to be split up.
@jszwedko Any thoughts on this? I'm new to rust, but wouldn't mind trying to contribute the simpler "lambda" version if you think that's the way to go.
@jszwedko Any thoughts on this? I'm new to rust, but wouldn't mind trying to contribute the simpler "lambda" version if you think that's the way to go.
Hey! Apologies, I remember drafting a response, but I must have got interrupted before sending it.
I did a quick survey of the implementations linked to from https://brandur.org/logfmt and, what a mess 😅
https://pkg.go.dev/github.com/kr/logfmt (it seems like the package you found is based on that one)
Only handles parsing.
Has a EBNF description:
ident_byte = any byte greater than ' ', excluding '=' and '"'
string_byte = any byte excluding '"' and '\'
garbage = !ident_byte
ident = ident_byte, { ident byte }
key = ident
value = ident | '"', { string_byte | '\', '"' }, '"'
pair = key, '=', value | key, '=' | key
message = { garbage, pair }, garbage
This seems pretty reasonable, formalized, and strict enough to be likely to be parsable by other implementation: all of which support parsing values with "
s except the clojure one. It's my favorite option. I think we can strip characters from key that don't fit, rather than replacing (at least for this initial implementation).
https://github.com/yeller/logfmt
Only handles generation.
Seems to not do any particular escaping or quoting of values.
Its keys are Clojure symbols, so the character set is restricted, but still pretty permissive: https://clojure.org/reference/reader#_symbols
Seems like a fairly naive implementation.
https://github.com/jteppinette/python-logfmter/blob/main/logfmter/formatter.py
Only handles generation.
Quotes values that contain spaces, newlines, quotes, or equals; escaping quotes and newline characters.
Keys are stripped of newlines or spaces, but nothing else (e.g. =
s).
https://github.com/cyberdelia/logfmt-ruby/blob/master/lib/logfmt/parser.rb
Only handles parsing.
Seems to be modeled after the https://pkg.go.dev/github.com/kr/logfmt , parsing restricted keys and quoted values.
With regards to implementation, I think we could accept an approach using closures to pre-process the keys and values, and then pass along to encode_key_value
, with sufficient testing at the encode_logfmt
level. I agree there would be sufficient duplication if we did separate the implementations so it would be nice re-share some logic if we did separate them. I recognize you mentioned you were new to Rust, so it would be a lot take on :smile: and I think, again with sufficient tests, that your proposed preprocessing approach could suffice.
I did a quick survey of the implementations linked to from https://brandur.org/logfmt and, what a mess 😅
Not surprised actually. logfmt
has no formal spec, so edge cases are bound to slip through.
I actually discovered the https://pkg.go.dev/github.com/kr/logfmt package because Vector actually links to it as the logfmt spec. However, we use Loki which appears to use go-logfmt
and which also happens to refer to https://pkg.go.dev/github.com/kr/logfmt as the "most authoritative public specification to date".
All this is to say that I personally think it's safe to use that as the spec as it appears to be the strictest.
With regards to implementation, I think we could accept an approach using closures to pre-process the keys and values, and then pass along to encode_key_value, with sufficient testing at the encode_logfmt level. I agree there would be sufficient duplication if we did separate the implementations so it would be nice re-share some logic if we did separate them. I recognize you mentioned you were new to Rust, so it would be a lot take on 😄 and I think, again with sufficient tests, that your proposed preprocessing approach could suffice.
Great. Can't promise this will happen anytime this week, but will certainly give it a shot soon.
Side note: Seems like Rust could use a fully featured logfmt crate.
Side note: Seems like Rust could use a fully features logfmt crate.
It looks like there is actually a crate from the author of https://brandur.org/logfmt, themselves, for parsing: https://github.com/brandur/logfmt . It has a relatively simple implementation that doesn't interact with serde as ours does. Probably a good additional reference though. I wouldn't be opposed to us extracting our implementation to create a new crate for encoding and decoding logfmt with serde. I could see other Rust projects finding it useful.
Hi @jalaziz @jszwedko, did anything changed on this? I still see that spaces are allowed when parsing logfmt. Would this be a good first contribution I could try helping with?
Hi @jalaziz @jszwedko, did anything changed on this? I still see that spaces are allowed when parsing logfmt. Would this be a good first contribution I could try helping with?
Hey! Yeah, nothing has changed around this yet. We'd be happy to see a contribution! See the above comments for ideas on what the implementation should look like (likely a specialized parser rather than wrapping parse_key_value
).
A note for the community
Problem
Even with vectordotdev/vector#12700, Vector's
logfmt
encoding does not conform to the semi-official logfmt spec.In particular, the handling of quoted keys and values is still off. Some example scenarios are:
Configuration
Version
0.22
Debug Output
No response
Example Data
Additional Context
I've put together some basic tests of go-logfmt in this Go playground.
The current implementation of
logfmt
in Vector relies onencode_key_value
. Despitelogfmt
being the primary use case forencode_key_value
, fixing these issues seemslogfmt
-specific. One option might be preprocessing keys and values to match the spec?Additionally, even though
go-logfmt
drops invalid characters from keys, other options may be more desirable (e.g. replacing invalid characters with underscores).References
vectordotdev/vector#12674