Open hhromic opened 2 years ago
Looks like inded those escape characters are not implemented in the VRL lexer:
After further investigation, it seems that these escape sequences used to work before, but they were not ported over in the VRL parser/compiler re-implementation PR vectordotdev/vector#6353.
Interesting, thanks for the sleuthing! cc/ @StephenWakely @JeanMertz for thoughts on this.
This is correct. I didn't port them over, not for any particular reason, other than wanting to start out simple, and adding more escape sequences as requested by the community.
I think it makes sense to handle escape characters similar to how the Rust compiler handles them.
I'm glad the omission was just for simplicity of the initial implementation instead of a limitation :) And yes, having parity with Rust's escape character mechanism would be very nice and complete indeed.
In fairness, I don't think the missing escape characters I reported are a big use-case, I haven't seen anyone asking for them except for me in this issue after a long time since the new lexer was introduced.
The use-case for me that triggered discovering the missing escape characters is that I was testing my implementations of ip_ntop()
and ip_pton()
using the VRL REPL and I wasn't able to generate strings with arbitrary bytes. In other words, the escape I needed was the binary \xNN
escape. Then the documentation pointed me to \uNNNNNN
(which is not exactly the same but close-enough), which didn't work either.
In VRL, I'm having troubles with the limits Rust has for escaped characters and regex. PCRE2, Python or Golang style is preferable to Rust or ECMAScript here.
Thx
Is anyone by any chances working on this?, I did try to add null escape in the mentioned new lexer, but I can't get it to work, still says \0
is an invalid escape.
I'm not very well versed on Rust so maybe I'm missing something, this is my fork, if anybody could point me in the right direction, any help would be appreciated.
https://github.com/vectordotdev/vrl/compare/main...arthmoeros:vrl:null_escape_lexer
We do have a use case where we need to remove null characters in a message, a workaround would also be welcome. In the meantime I will try with a lua transform.
@arthmoeros Your code looks fine and works for me.
I'm curious how you are testing it? Perhaps you are still running the Vector project that is pulling VRL from Github and not from your local repo?
@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.
https://github.com/arthmoeros/vector/commit/4f8eb9e3af052d23147f3cb9389dcc9b7661e284
So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)
@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.
So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)
Ah, you want to specify branch
rather than rev
.
vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }
An easier way to test directly in the VRL repo is to just run the cli
project:
> cd lib/cli
> cargo run
@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change. arthmoeros/vector@4f8eb9e So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)
Ah, you want to specify
branch
rather thanrev
.vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }
An easier way to test directly in the VRL repo is to just run the
cli
project:> cd lib/cli > cargo run
Wonderful, thanks for the tip on the cli project, worked like charm and indeed easier to test.
Just submitted the PR: https://github.com/vectordotdev/vrl/pull/219
Thanks a lot for your help!
would this be the reason that this:
source = '''
. |= parse_regex!(
.message,
pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
)
.log_time, err = parse_timestamp(.log_time, "%Y-%m-%d %H:%M:%S,%f")
.process = to_int!(.process)
.thread = to_int!(.thread)
'''
causes this exception:
| 2023-05-15T04:21:51.724291Z INFO vector::app: Loading configs. paths=["/etc/vector/vector.toml"]
| 2023-05-15T04:21:51.747923Z ERROR vector::topology: Configuration error. error=Transform "log_parser":
| error[E202]: syntax error
| โโ :1:1
| โ
| 1 โ โญ . |= parse_regex!(
| 2 โ โ .message,
| 3 โ โ pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
| 4 โ โ )
| ยท โ
| 7 โ โ .thread = to_int!(.thread)
| 8 โ โ
| โ โฐ^ unexpected error: invalid escape character: \[
Is there a workaround for this?
Is there a workaround for this?
Hi. There are multiple issues with your code, none of which are related to this issue.
:
- so use pattern:
rather than pattern =
for the pattern parameter.r'..'
rather than r".."
- this is the primary cause the error you are getting.[...]
.This should work:
. |= parse_regex!(
.message,
pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)'
)
Is there a workaround for this?
Hi. There are multiple issues with your code, none of which are related to this issue.
- Named parameters use
:
- so usepattern:
rather thanpattern =
for the pattern parameter.- Regex literals use
r'..'
rather thanr".."
- this is the primary cause the error you are getting.- The pattern parameter is a single regex, not an array - so don't enclose the pattern with
[...]
.This should work:
. |= parse_regex!( .message, pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)' )
ok thank you - now i know. i actually had success with the grok patterns. but have a question. In the below i have several parsing patterns. if the 'msg' field i parse out may contain a pattern like 'app_id=app_1234', can i post parse and add 'app_id' as a field, rather than include another parser pattern for it as i do below? It seems it would be cleaner to post parse and add optional fields that may or may not be in the msg. Are there any examples of doing this?
source = '''
. |= parse_groks!(
.message,
patterns:[
"\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] app_id:%{NOTSPACE:app_id} %{GREEDYDATA:msg}",
"\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] %{GREEDYDATA:msg}",
"\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]%{WORD:severity}:%{WORD:module}:%{GREEDYDATA:msg}"
]
)
'''
example logs with/and without app_id:
[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] initalizing application
[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] app_id:my_app_12345 initialized.
- single regex, not an array - so don't enclose the pattern with
[...]
.
Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?
- single regex, not an array - so don't enclose the pattern with
[...]
.Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?
๐ do you mind opening this as a separate discussion post since it is unrelated to this issue?
Done...sorry for hijacking threads...https://github.com/vectordotdev/vrl/issues/246
Just chiming in to mention that the docs example for strip_ansi_escape_codes!()
fails with \e
escape code. It suggests to use the VRL REPL which likewise fails:
$ vector vrl
$ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")
error[E202]: syntax error
โโ :1:1
โ
1 โ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")
โ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unexpected error: invalid escape character: \e
โ
= see language documentation at https://vrl.dev
= try your code in the VRL REPL, learn more at https://vrl.dev/examples
In my case, I was actually trying to apply ANSI escape codes for coloured output in the console
sink, which either outputs unescaped (or partial) ANSI codes as plain text, or fails with an error like above.
I wasn't sure if it was unsupported by the sink, but perhaps it's related to this issue (Vectors own internal log output leveraged colour without issue):
It is somewhat frustrating that I cannot cut-n-paste (json encoded, e.g. \u..) ANSI escape strings from logs to vector test so that I could ensure that my pipeline (correctly) handles their stripping. In regexps it is less important, I guess, although it would be nice to have it working there too.
Should probably adjust the documentation on string expressions until support for these escape sequences is re-added https://vector.dev/docs/reference/vrl/expressions/#string-characteristics. I thought I was doing something wrong.
A note for the community
Problem
I recently found out that the null and unicode character escape sequences are not working in string literal expressions in VRL. The documentation makes sense, therefore I think something is off in the VRL parser.
Examples from the VRL REPL:
All the other documented escape sequences seems to be working fine (although not all printed correctly):
Configuration
Version
vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response