jteppinette / python-logfmter

Using the stdlib logging module and without changing a single logging call, logfmter supports global (first and third party) logfmt structured logging.
https://pypi.org/project/logfmter/
MIT License
49 stars 5 forks source link

Newlines in values probably shouldn't be escaped #14

Open Feuermurmel opened 4 weeks ago

Feuermurmel commented 4 weeks ago

The documentation of Go package kr/logfmt states:

value = ident | '"', { string_byte | '\', '"' }, '"'

I.e. that a \ can only be followed by a ".

I tested Grafana's logfmt parser and it doesn't accept e.g foo="a\nb". It simply discards the pair. Tested with a query like this:

{...} | line_format "foo=\"a\\nb\"" | logfmt

But this works:

{...} | line_format "foo=\"a\nb\"" | logfmt

Producing a field containing a newline.

jteppinette commented 4 weeks ago

@Feuermurmel Thanks for this report.

Interesting, are you saying that this {...} | line_format "foo=\"a\\nb\"" | logfmt also doesn't work?

I am surprised that grafana's logfmt parser works just by espacing the outer quotes as seen here: foo=\"a\nb\".

Let me work on this a bit.

Let me know if you have any thoughts on resolution.

Feuermurmel commented 2 weeks ago

Interesting, are you saying that this {...} | line_format "foo=\"a\\nb\"" | logfmt also doesn't work?

I think I'm mistaken here. It seems that a \n embedded in a field does actually work and is converted to a newline on parsing:

image

I am surprised that grafana's logfmt parser works just by espacing the outer quotes as seen here: foo=\"a\nb\".

Yes, this does work:

image

It seems that a field's value is followed from the opening " across newlines until a terminating ". Keep in mind that Loki records logs as separate log entries, which can consist of multiple lines, rather than a stream of text where boundaries between log records would be unclear.

Let me know if you have any thoughts on resolution.

In our case, this feature of Loki's Logfmt decoder is helpful because we're looking at logs both when processed via a Logfmt parser (i.e. in Grafana), as well as in raw text form (logs written to text files or to Systemd's journal). So for our case, I wrote a Logfmt encode that leaves newlines as-is.

Here's the implementation I created. It hasn't been tested more than a few days now, but seems to work:

# Allowed characters in names and unquoted values:
# Printable ASCII except ` `, `"` and `=`.
# In keys, other characters are replaced by `_`.
# In values, the value will be quoted.
_disallowed_char_in_idents = re.compile(r'[^!-~]|["=]')

# Characters that need to be escaped in quoted values: `"` and `\`.
_escaped_char_in_values = re.compile(r'(?=["\\])')

def value_to_str(value: object) -> str:
    # This can be extended to support additional types,
    # but in most cases it's probably better to convert the value to a string
    # at the site where the logging method is called.
    if isinstance(value, str):
        return value
    elif isinstance(value, datetime):
        return value.isoformat()
    else:
        return str(value)

def encode_field(name: str, value: str) -> str:
    # Replace invalid characters in field names.
    name = _disallowed_char_in_idents.sub("_", name)

    # Quote the value if necessary.
    if _disallowed_char_in_idents.search(value):
        escaped_value = _escaped_char_in_values.sub(r"\\", value)
        value = f'"{escaped_value}"'

    return f"{name}={value}"

def encode_fields(fields: Sequence[tuple[str, object]]) -> str:
    # Sequence of lines, each containing a sequence of encoded fields.
    lines: list[list[str]] = [[]]

    for name, value in fields:
        value_str = value_to_str(value)
        # Place field with multi-line values on their own line to improve readability.
        # Otherwise, fields are all placed on a single line.
        has_newline = "\n" in value_str

        if has_newline and lines[-1]:
            lines.append([])

        lines[-1].append(encode_field(name, value_str))

        if has_newline:
            lines.append([])

    if not lines[-1]:
        del lines[-1]

    return "\n".join(map(" ".join, lines))
Feuermurmel commented 2 weeks ago

I am surprised that grafana's logfmt parser works just by espacing the outer quotes as seen here: foo=\"a\nb\".

Oh, or did you mean that the Logfmt parser parses the field even when the " at the start and end of the value are escaped as \"?

These aren't actually escaped in the string that the Logfmt parser sees, the escaping comes from the syntax used to call line_format. The logfmt parser sees foo="a\nb" as its input. This can be seen in the screen shot on the line in the output that also shows the timestamp.