Open jszwedko opened 3 years ago
+1 for this. I was migrating from a Logstash based config where I'm using gsub
to achieve this. I wanted to preseve the first and last few characters of a sensitive token field but looks like that isn't possible.
For example if this could work: replace(.message,r'(my_token)(.*?):(.*?)(\S{8})', r'\1*\3')
<-I wanted to preserve the field name itself and the last 8 chars.
Is there any workaround using other string substitution methods?
@mr-karan you can still use replace
to achieve this, but:
$1
to reference capture groupsreplace
has to be a string$ .message = "my_token:abcdefghijklmnopqrstuvwxyz"
"my_token:abcdefghijklmnopqrstuvwxyz"
$ replace(token, r'(my_token):(.*)(\S{8})', "$1*$3")
"my_token*stuvwxyz"
You can try it out yourself by running vector vrl
.
@JeanMertz Thanks for the help. Works well :+1:
@JeanMertz A bit perplexed here. I tried out the replace
in vrl
and it worked perfectly fine. However it doesn't work in the actual pipeline. I wrote a small unit test for you to check. (I can open a new issue if that is more relevant)
[transforms.format_logs]
type = "remap"
inputs = ["haproxy_logs"]
source = '''
.message = replace!(.message,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
'''
[[tests]]
name = "check if token is redacted"
[[tests.inputs]]
insert_at = "format_logs"
type = "raw"
value = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"
[[tests.outputs]]
extract_from = "format_logs"
[[tests.outputs.conditions]]
type = "check_fields"
"message.equals" = "auth=token myapp:*s1nVYamq&-"
When running vector test
:
Running tests
Jul 02 10:58:25.195 WARN vector::conditions::check_fields: The `check_fields` condition is deprecated, use `remap` instead.
test check if token is redacted ... failed
failures:
test check if token is redacted:
check transform 'format_logs' failed conditions:
condition[0]: predicates failed: [ message.equals: "auth=token myapp:*s1nVYamq&-" ]
payloads (events encoded as JSON):
input: {"message":"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-","timestamp":"2021-07-02T05:28:25.194973128Z"}
output: {"message":":*&-","timestamp":"2021-07-02T05:28:25.194973128Z"}
When doing the same thing with vector vrl
:
$ msg = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"
"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"
$ replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
"auth=token myapp:*s1nVYamq&-"
I am really confused how this is happening :dizzy_face:
Hi @mr-karan . I think you running into the same issue as https://github.com/timberio/vector/issues/8067.
The issue is that $1
is interpreted when Vector loads the config to mean you want to inject the environment variable $1
int the config file. This behavior is described here: https://vector.dev/docs/reference/configuration/#environment-variables
You can escape the $
via $$
so something like replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$$1$$2:*$$4&")
should work for you.
This is a pretty big gotcha though as the replacement groups use $
. I'm wondering if we could improve the error messaging here; although right now should at least see a warning in the output of vector
that $1
, $2
, etc. are undefined when starting Vector.. Certainly we could call it out in the documentation at least.
Thanks @jszwedko for the explanation :) Escaping $
worked!
Broken off from https://github.com/timberio/vector/pull/7250#discussion_r631912508
The initial implementation of
redact
just had one redactor that always replaced with[REDACTED]
. We should expand this to support additional redactors like: