vectordotdev / vrl

Vector Remap Language
Mozilla Public License 2.0
134 stars 66 forks source link

Expand supported redactors for redact #112

Open jszwedko opened 3 years ago

jszwedko commented 3 years ago

Broken off from https://github.com/timberio/vector/pull/7250#discussion_r631912508

The initial implementation of redact just had one redactor that always replaced with [REDACTED]. We should expand this to support additional redactors like:

mr-karan commented 3 years ago

+1 for this. I was migrating from a Logstash based config where I'm using gsub to achieve this. I wanted to preseve the first and last few characters of a sensitive token field but looks like that isn't possible.

For example if this could work: replace(.message,r'(my_token)(.*?):(.*?)(\S{8})', r'\1*\3') <-I wanted to preserve the field name itself and the last 8 chars.

Is there any workaround using other string substitution methods?

JeanMertz commented 3 years ago

@mr-karan you can still use replace to achieve this, but:

  1. You need to use $1 to reference capture groups
  2. The third argument to replace has to be a string
$ .message = "my_token:abcdefghijklmnopqrstuvwxyz"
"my_token:abcdefghijklmnopqrstuvwxyz"

$ replace(token, r'(my_token):(.*)(\S{8})', "$1*$3")
"my_token*stuvwxyz"

You can try it out yourself by running vector vrl.

mr-karan commented 3 years ago

@JeanMertz Thanks for the help. Works well :+1:

mr-karan commented 3 years ago

@JeanMertz A bit perplexed here. I tried out the replace in vrl and it worked perfectly fine. However it doesn't work in the actual pipeline. I wrote a small unit test for you to check. (I can open a new issue if that is more relevant)

[transforms.format_logs]
type = "remap" 
inputs = ["haproxy_logs"] 
source = '''
.message = replace!(.message,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
'''

[[tests]]
  name = "check if token is redacted"

  [[tests.inputs]]
    insert_at = "format_logs"
    type = "raw"
    value = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"

  [[tests.outputs]]
    extract_from = "format_logs"

    [[tests.outputs.conditions]]
      type = "check_fields"
      "message.equals" = "auth=token myapp:*s1nVYamq&-"

When running vector test:

Running tests
Jul 02 10:58:25.195  WARN vector::conditions::check_fields: The `check_fields` condition is deprecated, use `remap` instead.
test check if token is redacted ... failed

failures:

test check if token is redacted:

check transform 'format_logs' failed conditions:
  condition[0]: predicates failed: [ message.equals: "auth=token myapp:*s1nVYamq&-" ]
payloads (events encoded as JSON):
   input: {"message":"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-","timestamp":"2021-07-02T05:28:25.194973128Z"}
  output: {"message":":*&-","timestamp":"2021-07-02T05:28:25.194973128Z"}

When doing the same thing with vector vrl:

$ msg = "auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"
"auth=token myapp:GDyB5onL3Whi69RY2MELVPLWs1nVYamq&-"

$ replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$1$2:*$4&")
"auth=token myapp:*s1nVYamq&-"

I am really confused how this is happening :dizzy_face:

jszwedko commented 3 years ago

Hi @mr-karan . I think you running into the same issue as https://github.com/timberio/vector/issues/8067.

The issue is that $1 is interpreted when Vector loads the config to mean you want to inject the environment variable $1 int the config file. This behavior is described here: https://vector.dev/docs/reference/configuration/#environment-variables

You can escape the $ via $$ so something like replace(msg,r'(auth=token)(.*?):(.*?)(\S{8})&', "$$1$$2:*$$4&") should work for you.

This is a pretty big gotcha though as the replacement groups use $. I'm wondering if we could improve the error messaging here; although right now should at least see a warning in the output of vector that $1, $2, etc. are undefined when starting Vector.. Certainly we could call it out in the documentation at least.

mr-karan commented 3 years ago

Thanks @jszwedko for the explanation :) Escaping $ worked!