vrl: null and unicode escapes not working in string literals

hhromic commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I recently found out that the null and unicode character escape sequences are not working in string literal expressions in VRL. The documentation makes sense, therefore I think something is off in the VRL parser.

Examples from the VRL REPL:

$ "hello\0world"
error[E202]: syntax error
  ┌─ :1:1
  │
1 │ "hello\0world"
  │ ^^^^^^^^^^^^^^ unexpected error: invalid escape character: \0
  │
  = see language documentation at https://vrl.dev

$ "hello\u{1F30E}world"  # also tested with \U{1F30E} and \u1F30E just in case
error[E202]: syntax error
  ┌─ :1:1
  │
1 │ "hello\u{1F30E}world"
  │ ^^^^^^^^^^^^^^^^^^^^^ unexpected error: invalid escape character: \u
  │
  = see language documentation at https://vrl.dev

All the other documented escape sequences seems to be working fine (although not all printed correctly):

$ "hello\nworld"
"hello\nworld"

$ "hello\"world"
"hello\"world"

$ "hello\'world"
"hello'world"

$ "hello\\world"
"hello\\world"

$ "hello\nworld"
"hello\nworld"

$ "hello\rworld"  # \r (carriage-return) was rendered instead of printed with an escape
world"

$ "hello\tworld"  # \t (tab) was rendered instead of printed with an escape
"hello  world"

Configuration

N/A

Version

vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

hhromic commented 2 years ago

Looks like inded those escape characters are not implemented in the VRL lexer:

https://github.com/vectordotdev/vector/blob/7cb951daed57448b80463303a07184ed537d9c54/lib/vrl/parser/src/lex.rs#L1112-L1128

https://github.com/vectordotdev/vector/blob/7cb951daed57448b80463303a07184ed537d9c54/lib/vrl/parser/src/lex.rs#L1164-L1197

hhromic commented 2 years ago

After further investigation, it seems that these escape sequences used to work before, but they were not ported over in the VRL parser/compiler re-implementation PR vectordotdev/vector#6353.

jszwedko commented 2 years ago

Interesting, thanks for the sleuthing! cc/ @StephenWakely @JeanMertz for thoughts on this.

JeanMertz commented 2 years ago

This is correct. I didn't port them over, not for any particular reason, other than wanting to start out simple, and adding more escape sequences as requested by the community.

I think it makes sense to handle escape characters similar to how the Rust compiler handles them.

hhromic commented 2 years ago

I'm glad the omission was just for simplicity of the initial implementation instead of a limitation :) And yes, having parity with Rust's escape character mechanism would be very nice and complete indeed.

In fairness, I don't think the missing escape characters I reported are a big use-case, I haven't seen anyone asking for them except for me in this issue after a long time since the new lexer was introduced.

The use-case for me that triggered discovering the missing escape characters is that I was testing my implementations of ip_ntop() and ip_pton() using the VRL REPL and I wasn't able to generate strings with arbitrary bytes. In other words, the escape I needed was the binary \xNN escape. Then the documentation pointed me to \uNNNNNN (which is not exactly the same but close-enough), which didn't work either.

EdN-terascope commented 1 year ago

In VRL, I'm having troubles with the limits Rust has for escaped characters and regex. PCRE2, Python or Golang style is preferable to Rust or ECMAScript here.

Thx

arthmoeros commented 1 year ago

Is anyone by any chances working on this?, I did try to add null escape in the mentioned new lexer, but I can't get it to work, still says \0 is an invalid escape.

I'm not very well versed on Rust so maybe I'm missing something, this is my fork, if anybody could point me in the right direction, any help would be appreciated.

https://github.com/vectordotdev/vrl/compare/main...arthmoeros:vrl:null_escape_lexer

We do have a use case where we need to remove null characters in a message, a workaround would also be welcome. In the meantime I will try with a lua transform.

StephenWakely commented 1 year ago

@arthmoeros Your code looks fine and works for me.

I'm curious how you are testing it? Perhaps you are still running the Vector project that is pulling VRL from Github and not from your local repo?

arthmoeros commented 1 year ago

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.

https://github.com/arthmoeros/vector/commit/4f8eb9e3af052d23147f3cb9389dcc9b7661e284

So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

StephenWakely commented 1 year ago

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.

arthmoeros/vector@4f8eb9e

So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

Ah, you want to specify branch rather than rev.

vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }

An easier way to test directly in the VRL repo is to just run the cli project:

> cd lib/cli
> cargo run

arthmoeros commented 1 year ago

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change. arthmoeros/vector@4f8eb9e So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

Ah, you want to specify branch rather than rev.
vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }
An easier way to test directly in the VRL repo is to just run the cli project:
> cd lib/cli
> cargo run

Wonderful, thanks for the tip on the cli project, worked like charm and indeed easier to test.

Just submitted the PR: https://github.com/vectordotdev/vrl/pull/219

Thanks a lot for your help!

dss010101 commented 1 year ago

would this be the reason that this:

source = '''
    . |= parse_regex!(
            .message, 
            pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
    )
    .log_time, err = parse_timestamp(.log_time, "%Y-%m-%d %H:%M:%S,%f")
    .process = to_int!(.process)
    .thread = to_int!(.thread)
'''

causes this exception:


        | 2023-05-15T04:21:51.724291Z  INFO vector::app: Loading configs. paths=["/etc/vector/vector.toml"]
        | 2023-05-15T04:21:51.747923Z ERROR vector::topology: Configuration error. error=Transform "log_parser":
        | error[E202]: syntax error
        |   ┌─ :1:1
        |   │
        | 1 │ ╭     . |= parse_regex!(
        | 2 │ │             .message,
        | 3 │ │             pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
        | 4 │ │     )
        |   · │
        | 7 │ │     .thread = to_int!(.thread)
        | 8 │ │
        |   │ ╰^ unexpected error: invalid escape character: \[

Is there a workaround for this?

StephenWakely commented 1 year ago

Is there a workaround for this?

Hi. There are multiple issues with your code, none of which are related to this issue.

Named parameters use : - so use pattern: rather than pattern = for the pattern parameter.
Regex literals use r'..' rather than r".." - this is the primary cause the error you are getting.
The pattern parameter is a single regex, not an array - so don't enclose the pattern with [...].

This should work:

. |= parse_regex!(
            .message, 
            pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)'
    )

dss010101 commented 1 year ago

Is there a workaround for this?

Hi. There are multiple issues with your code, none of which are related to this issue.

Named parameters use : - so use pattern: rather than pattern = for the pattern parameter.

Regex literals use r'..' rather than r".." - this is the primary cause the error you are getting.

The pattern parameter is a single regex, not an array - so don't enclose the pattern with [...].

This should work:
. |= parse_regex!(
            .message, 
            pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)'
    )

ok thank you - now i know. i actually had success with the grok patterns. but have a question. In the below i have several parsing patterns. if the 'msg' field i parse out may contain a pattern like 'app_id=app_1234', can i post parse and add 'app_id' as a field, rather than include another parser pattern for it as i do below? It seems it would be cleaner to post parse and add optional fields that may or may not be in the msg. Are there any examples of doing this?

source = '''
    . |= parse_groks!(
            .message, 
            patterns:[
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] app_id:%{NOTSPACE:app_id} %{GREEDYDATA:msg}",
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] %{GREEDYDATA:msg}",
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]%{WORD:severity}:%{WORD:module}:%{GREEDYDATA:msg}"
            ]
    )

'''

example logs with/and without app_id:

[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] initalizing application
[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] app_id:my_app_12345 initialized.

dss010101 commented 1 year ago

single regex, not an array - so don't enclose the pattern with [...].

Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?

jszwedko commented 1 year ago

single regex, not an array - so don't enclose the pattern with [...].

Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?

👋 do you mind opening this as a separate discussion post since it is unrelated to this issue?

dss010101 commented 1 year ago

Done...sorry for hijacking threads...https://github.com/vectordotdev/vrl/issues/246

polarathene commented 11 months ago

Just chiming in to mention that the docs example for strip_ansi_escape_codes!() fails with \e escape code. It suggests to use the VRL REPL which likewise fails:

$ vector vrl
$ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")

error[E202]: syntax error
  ┌─ :1:1
  │
1 │ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")
  │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unexpected error: invalid escape character: \e
  │
  = see language documentation at https://vrl.dev
  = try your code in the VRL REPL, learn more at https://vrl.dev/examples

In my case, I was actually trying to apply ANSI escape codes for coloured output in the console sink, which either outputs unescaped (or partial) ANSI codes as plain text, or fails with an error like above.

I wasn't sure if it was unsupported by the sink, but perhaps it's related to this issue (Vectors own internal log output leveraged colour without issue):

fingon commented 8 months ago

It is somewhat frustrating that I cannot cut-n-paste (json encoded, e.g. \u..) ANSI escape strings from logs to vector test so that I could ensure that my pipeline (correctly) handles their stripping. In regexps it is less important, I guess, although it would be nice to have it working there too.

DylanRJohnston commented 4 months ago

Should probably adjust the documentation on string expressions until support for these escape sequences is re-added https://vector.dev/docs/reference/vrl/expressions/#string-characteristics. I thought I was doing something wrong.

vectordotdev / vrl