vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.42k stars 1.51k forks source link

Remap traversal and function application #4518

Closed binarylogic closed 3 years ago

binarylogic commented 3 years ago

A very powerful feature of the Remap language is the ability to traverse nested structured and apply functions to the provided values. This is especially useful for PII redaction.

Given that the Remap language does not have the ability to loop over data, the syntax for this function will need consensus.

Examples

To replace SSNs:

traverse(., replace(/^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX"))

Obviously this is not a great syntax, but I'm unsure what precedence we have for this type of thing. I'm open to ideas.

cc @JeanMertz @FungusHumungus for ideas.

lukesteensen commented 3 years ago

For some prior art to consider, jq has both the recurse and walk functions that seem very similar to this.

JeanMertz commented 3 years ago

As an exercise to find the best possible syntax for this, I've written out a somewhat complex example.

I should note that there is a case to be made for people to resort to Lua or Wasm if they need something as complex as this example, but I think it's right on the border of something we'd still want to support with TRL.

Given this input:

{
    "orders": {
        "a": 1,
        "b": 2,
        "c": 3,
        "d": [4, 5, 6],
        "e": {
            "regular": 7,
            "special": 8,
        }
    }
}

I want to get this output:

{
    "total": 36,
    "products": [
        "a",
        "b",
        "c",
        "d",
        "e/regular",
        "e/special",
    ],
    "orders": { /* ... */ },
}

To achieve this, I need to:


Here is one solution that I've come up with:

// new fields
.total = 0
.products = []

// store the root object in a variable, so we can access it within `walk`
$root = .

// walk the object, starting at the "orders" field
walk .orders {
    // for the orders map, we want to store their keys as product names
    if . == $root.orders {
        // we use `entries` to gain access to the individual key/value pairs in the map
        walk entries(.) {
            // alternative syntax: we add a new `push` function, or support appending to arrays using `+`
            .products = merge(.products, [.key])

            if is_map(.value) {
                // delete the last product, store its name for future reference
                $product = del(.products[len(.products) - 1])

                // iterate over the subcategories and append them to the list of products
                walk keys(.value) {
                    push(.products, $product + "/" + .)
                }
            }
        }
    } else if is_int(.) {
        .total = total + .
    }
}

Let's quickly unpack this (although I think it's fairly straightforward):

We can name walk whatever we like (iter, traverse, for, etc), but the concept remains the same: you provide two expressions, and get access to the iterated value(s) of that expression through the . dot syntax.

There's one major drawback I currently see (but I'm sure there are more, so please weigh in here):

There's a big caveat to the above script, which is that even if we supported walk, the script won't work. This is because we are missing a fairly large set of features used above:


The goal of this exercise for me was to find a non-trivial example and see which syntax feels best while still giving the most flexibility.

For reference, here are a few others I played with, but deemed non-ideal for different reasons (replace walk with whatever name you prefer):

// pure function — feels more cluttered, and walking should be a built-in
// statement similar to if-statements
walk(.orders, {
    // ...
})
// closure-like approach — also more cluttered, and you get into cases where
// $key is undefined (or maybe the index) when iterating an array, also this
// requires more special syntax, and it gives the impression that we have _true_
// closures, which we do not
walk .orders |key, value| {
    // access $key and $value here
}
// substitute walk for a special symbol — I actually like this one, but having
// an actual word (similar to "if") makes it much less cryptic for the average
// reader of these scripts what's going on
.orders -> {

}
// another variant that I considered, and definitely like. Especially because it makes the
// block-less variant of walking a bit less weird to read. I would probably be in favor if this
// variant, but I can also see the reasoning that it adds more symbol cluttering that we want to
// avoid.
walk .orders -> . = to_int(.)

One other thing this does expose, is that our current feature-set tends to favour rightward drift of your code if you have a lot of nested if/walk statements.

We could avoid this by introducing control-flow statements such as continue, break (both only work inside walk) and return (stops script execution, returns value), but I also realize this is eating into our complexity budget, so there's a trade-off to be made here.

Here's the same walk example, using control-flow statements (and the -> alternative syntax):

walk .orders -> {
    if . != $root.orders {
        if is_int(.) {
            .total = total + .
        }

        continue
    }

    walk entries(.) -> {
        .products = merge(.products, [.key])

        if !is_map(.value) {
            continue
        }

        $product = del(.products[len(.products) - 1])
        walk keys(.) -> .products = merge(.products, $product + "/" + .)
    }
}

Or, walking twice (without control flow):

// sum all integers
walk . -> if is_int(.) { .total = .total + . }

// build product list
walk entries(.orders) {
    .products = merge(.products, [.key])

    if is_map(.value) {
        $product = del(.products[len(.products) - 1])
        walk keys(.) -> .products = merge(.products, $product + "/" + .)
    }
}

Finally, here's the example posted by @binarylogic, using the proposed syntax:

// without block
walk . replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")

// with block
walk . {
    replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")
}

// without block, using `->` variant
walk . -> replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")

Curious what your thoughts are!

jszwedko commented 3 years ago

Nice work! I think this looks pretty good, but have a couple of questions below. I like the syntax using both the block and ->:

walk .orders -> {
}

You can't access the root path . inside walk, since . is repurposed to mean "the current root". It would be nice to find a solution to this, but I am a bit hesitant to introduce special syntax to do so, as I demonstrated above that you can easily do $root = . before you start walking to keep track of the object root.

I was going to point out that Go's text/template has this same problem of reassigning .. They use $. to refer to the global scope, but that might collide with our variable syntax. Assigning $root outside when users need it seems like a reasonable trade-off for now, we can always add syntax later.

For your replace example:

walk . -> replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")

Given that replace, I believe, returns a new value rather than modifying the given one. Does this mean that walk would replace the value represented with . by the last value of the block? If so, how does this work with walk entries(.) where the value being traversed is temporary?

I am having trouble figuring out what Elasticsearch dedot script (https://github.com/timberio/vector/issues/3588#issuecomment-692706150) would look like with this approach though. Maybe that'd be another good example to try with this? Especially given that https://github.com/timberio/vector/issues/3588 deferred to be implemented in remap. I feel like it might require functions and recursion.

JeanMertz commented 3 years ago

They use $. to refer to the global scope, but that might collide with our variable syntax.

Since . isn't a valid ident (and thus a variable with that name cannot be created by the user), we are free to use the $. syntax to reference the global root path. But I agree, we can punt on that for now 👍

For your replace example:

walk . -> replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")

Given that replace, I believe, returns a new value rather than modifying the given one. Does this mean that walk would replace the value represented with . by the last value of the block?

Ah shoot! I totally missed that.

So for replace you'd write:

walk . -> . = replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")

// or
walk . -> {
    . = replace(., /^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$/, "XXX-XXX-XXXX")
}

I am having trouble figuring out what Elasticsearch dedot script (#3588 (comment)

Thanks for that example. My first instinct would be something like this:

if !is_map(.kubernetes) {
    return // assuming we support this, otherwise wrap the entire script in an if-statement
}

$map = {}
walk entries(.kubernetes) -> {
    .key = replace(.key, ".", "_")

    // we don't have this function yet, but we'd need something that
    // allows us to dynamically set object paths
    insert($map, .key, .value)
}

.kubernetes = $map

But I realize that this risks recursing into nested maps/arrays. We'll likely want a similar syntax to iterate over maps or arrays without recursing. I could see us adding the same syntax but using the iter keyword to only traverse the top-level container, but I haven't fully thought that through yet.

Something to think of over the weekend 🤔

jszwedko commented 3 years ago

I think in the ES case, we would actually want to traverse nested maps and arrays though that lua script does not. I think we'd want to translate things like:

{
  "kubernetes.pod_labels": {
    "addonmanager.kubernetes.io/mode": "Reconcile",
    "gcp-auth-skip-secret": "true",
    "integration-test": "storage-provisioner"
  }
}

into

{
  "kubernetes_pod_labels": {
    "addonmanager_kubernetes_io/mode": "Reconcile",
    "gcp-auth-skip-secret": "true",
    "integration-test": "storage-provisioner"
  }
}

Not recursing may be a need for other use-cases though.

StephenWakely commented 3 years ago

I don't think there is too much involved to get this working.

We are fortunate in that all function parameters are lazily evaluated, so there should not be too much concern there.

We already have (or will do soon) the concept of a block that is executed by an Expression in the If statement. That is a parenthesis delimited, semicolon or new line separated set of statements. It would not be too much of a strecth to do the same for Walk.

The only complication is what to call the variable we are walking over. I don't like the idea of clobbering an existing identifier, so using . as this introduces the concept of scope and Remap only works with a single global scope. As you note with recursive walks this could get complicated.

I'm almost tempted to suggest introducing a new syntactical element - perhaps % to represent the current object. That would ensure that we never get name conflicts, but I'm not quite sure about the ramifications to the underlying Expression code would be yet..

(Sorry, didn't mean to close the issue - hit the wrong button)

binarylogic commented 3 years ago

I haven't had time to really sit with this, so please don't read too heavily into my assessment. Seeing it in action is making me think my original proposal was too complicated. There are a few things about this syntax that feel off:

walk entries(.kubernetes) -> {
    .key = replace(.key, ".", "_")
    insert($map, .key, .value)
}
  1. It's non-obvious what is happening. I realize that's the point of this discussion 😄, but I wanted to mention it since a principle of the language is to be self-documenting.
  2. Both walk and entries(..) feel redundant.
  3. I'm unsure what the .key value would be. Is it the direct value or the full path?
  4. Having access to both the .key and .value attributes makes me think I'm performing a traditional, single-level loop, not traversing the data structure.

Perhaps we can start simpler before trying to break this down too much?

traverse_values(., redact($1, ...))

Or

traverse_keys(., replace($1, '.', '_')

I realize this is limiting, but it still leaves the door open for more complex loops if we come across a use case that requires it. The examples in previous comments do not appear to be real-world in the observability use-case. Again, two principles of this language are to be self-documenting and safe.

bruceg commented 3 years ago
* `walk` walks _all_ paths, and thus recurses into nested maps and arrays

If so, we also need a related expression that does not recurse.

* as with all other statements, this one is an expression as well, returning the last expression executed at the last walk iteration

I think making it an expression makes sense, however the "last expression executed" value seems a little arbitrary. Is there any value in creating a new object during traversal and returning that?

// substitute walk for a special symbol — I actually like this one, but having // an actual word (similar to "if") makes it much less cryptic for the average // reader of these scripts what's going on

I agree. Naming it makes it much less cryptic.

Perhaps we can start simpler before trying to break this down too much?

traverse_values(., redact($1, ...))
traverse_keys(., replace($1, '.', '_')

This looks like a very simple and powerful first step, particularly when combined with if-as-an-expression and things like is_map. Events that we will want to traverse are rarely arbitrarily deep. Since they have a known shape, full recursion may not be needed or even useful.

In these examples, it appears you are suggesting they replace the values and keys. I would recommend replace_values and replace_keys instead then.

JeanMertz commented 3 years ago

Thank you for the input everyone!

Note that I had written a line-by-line response to all the replies so far, but after multiple iterations/edits, I've trimmed this down to discuss @binarylogic's proposal, as I think that's what people are converging on.

Perhaps we can start simpler before trying to break this down too much?

traverse_values(., redact($1, ...))

Or

traverse_keys(., replace($1, '.', '_'))

After thinking this through and writing and refining my reply to this proposal four or five times, I've come to the conclusion that your example (once it considers the intrinsic details) is the same as my proposal, except for four changes:

  1. It uses the proposed function-like alternative syntax
  2. It splits key/value traversing up into two separate functions
  3. It handles assignments implicitly
  4. It uses $1 instead of . to reference the value/key being traversed

That is to say, these result in the same outcome:

walk . -> . = redact(., ...)

(using @bruceg's proposed name)

replace_values(., redact($1, ...))

Both syntaxes have their pro's and cons. Let’s discuss the four differences individually:

  1. function-like syntax

    I’m not against this, but as I (briefly) touched upon in my alternative syntax, it does feel a bit arbitrary that we have special syntax for if/else statements, but not for traversing key/values. The one argument I could make is that the former is a well-known paradigm in programming languages, whereas the latter is not really something you see in languages, other than the ones specifically built for the purpose of remapping structured data.

    That would also be exactly why I think it’s still worth considering making this a first-class statement, but it’s not something I have a heavy preference for, one way or the other.

    The other thing worth mentioning — but again it’s not that Important — is that TRL currently has no built-in functions. All functions we provide are implemented in Vector itself and injected into the TRL runtime. This was done for three reasons:

    1. It allows us to decide which functions are available in what context
    2. It keeps the core language simple/small (only object/variable accessors/assignment, if-statements and arithmetic)
    3. It increases our development speed (the combined set of functions pull in many external crates, increasing compile times significantly)

    So if we went this route, I would also implement this function in Vector to start. This has no impact on the outcome currently, but it’s worth mentioning.

  2. separate value/key traversal functions

    I don’t have too much to add here. It’s basically deciding between:

    replace_values(., ...) and replace_keys(., ...)

    vs:

    replace(., ...) + replace(keys(.), ...)

    I feel like we should be able to compose replace with a function such as keys and entries, but I also see the simplicity in just offering two separate functions for this. With my “language hat” on, I’d lean towards the second proposal because of the enhanced composability it provides, but there are still some unknowns that would have to be fleshed out there, and the former does make it explicit/easy to understand what you’re doing.

  3. in-place assignment

    If we take your example again, the function would basically turn this:

    replace_values(., redact($1, ...))

    Into this:

    replace_values(., $1 = redact($1, ...))

    That is to say, under the hood it would implicitly assign the result of the redact function to the value being traversed.

    This matches what jq does with walk, but it also can be confusing.

    Consider this example:

    { "foo": { "bar": "baz" } }
    replace_values(., replace($1, "baz", "qux"))

    What do you expect the outcome of this call to be? The outcome would be a runtime error:

    remap error: function “replace” error: expected string type for “value” argument, got map

    This happens because the function would first traverse the { "bar": "baz" } map value, before moving into that map and handling the ”baz” value for the bar key.

    You can see how a similar jq script deals with this:

    walk(if type == "array" then sort else . end)

    If we do in-place mutation, you’d have to do the same here:

    replace_values(., if is_string($1) {
       replace($1, "baz", "qux")
    } else {
       $1
    })

    _(quick aside: I suspect many calls to traverse_values will result in multi-line expressions, which doesn't look that great within function arguments, which is also why I'd lean towards dedicated syntax for this)_

    Again, under the hood this would translate to $1 = if is_string($1) ....

    This works, but I feel like removing the implicit assignment might makes things easier to understand:

    // switching to `traverse_values` instead of `replace_values`. since we're no longer implicitly mutating
    traverse_values(., if is_string($1) {
       $1 = replace($1, "baz", "qux")
    })

    Additionally, this might have performance benefits, as we no longer replace all values in a map (since the above else branch would translate to $1 = $1), but instead we only assign when we need to (although we might be able to turn $1 = $1 into a no-op as well).

    This brings me to my next point.

  4. reference current value/key using $1

    If we removed implicit assignments, then using the $1 = ... syntax feels out-of-place compared to the rest of the language.

    It’s certainly doable, but I would advocate to use my proposal of using . to reference to the existing key/value, instead of $1:

    traverse_values(., if is_string(.) {
       . = replace(., "baz", "qux")
    })

    This reads much more like “regular TRL syntax” than using $1.

    Granted, it does require one to know that within traverse_values the scope is updated and so . no longer points to the root object, but there’s definitely precedence for something like this (see Go, Jq, etc…).

    It should also be noted that if we went with $1, we’d have to special-case this in our grammar, since you wouldn’t actually be assigning to a variable, but to a reference that looks like a variable, but is in fact a reference to the object itself, which I find confusing.


Finally, it’s worth mentioning that traverse_key still has some open questions.

If we consider the traverse_values function, it traverses all key segments individually, and then runs the expression against that segment.

That is to say, given this value:

{ "foo": { "bar": { "baz": true } }, "qux": 10 }

traverse_values(. ,...) would iterate as follows:

  1. .foo => { "bar": { "baz": true } }
  2. .foo.bar => { "baz": true }
  3. .foo.bar.baz => true
  4. .qux => 10

How would traverse_keys iterate?

  1. . => foo
  2. . => qux
  3. .foo => bar
  4. .foo.bar => baz

That's one way to iterate, but I’m curious what others expect to happen here.

As mentioned before, you could also reason that traverse_keys is redundant, as you could do something like:

traverse(paths(.), {
    // mumble-mumble, hand-waving, do something with paths here
})

For traverse_values I think we have to decide on three questions:

  1. Should traversal use a built-in syntax or be implemented as a function?
  2. Should we implicitly or explicitly mutate values?
  3. Should we use $1 or . (or this, or …) to reference the current value?

Sorry for the long reply. This is quite a complex topic, and even after looking at other (similar) languages such as jq (using walk), tremor (using match), blob (using map_each method), etc, it's not clear to me that any one of them has the "perfect solution", and it's definitely clear to me that we don't have yet either.

bruceg commented 3 years ago

it does feel a bit arbitrary that we have special syntax for if/else statements, but not for traversing key/values.

This is a good point. It is good to syntactically separate those things that change control flow from a linear step. This introduces our first loop operation, and it makes sense to format it like conditional execution.

The one argument I could make is that the former is a well-known paradigm in programming languages, whereas the latter is not really something you see in languages, other than the ones specifically built for the purpose of remapping structured data.

It's not really that different from a loop, which is a well known paradigm. The semantics previously described give it recursion, but iterators in Rust can be given any kind of semantics, so that's not all that unusual.

I still think there needs to be some way of doing a walk non-recursively separate from recursing.

   I feel like we should be able to compose `replace` with a function such as `keys` and `entries`, but I also see the simplicity in just offering two separate functions for this. With my “language hat” on, I’d lean towards the second proposal because of the enhanced composability it provides, but there are still some unknowns that would have to be fleshed out there, and the former does make it explicit/easy to understand what you’re doing.

Agreed, composability would be preferred over special casing. My reading of the proposal for separate functions was to reduce the complexity of the initial implementation.

   If we removed implicit assignments, then using the `$1 = ...` syntax feels out-of-place compared to the rest of the language.

Would $. or $_ (ala Perl) be preferable to show that it is not the same as . outside of the walk and is changing each time through? In either case, we do need a "current value" special variable. I agree that using . reads much more like regular TRL syntax, but it's not the same . as outside the operation, which is very not like the rest of TRL.

traverse_values(. ,...) would iterate as follows:

1. `.foo` => `{ "bar": { "baz": true } }`
2. `.foo.bar` => `{ "baz": true }`
3. `.foo.bar.baz` => `true`
4. `.qux` => `10`

Should it do this order, or reverse 1-3 to do depth first?

How would traverse_keys iterate?

1. `.` => `foo`
2. `.` => `qux`
3. `.foo` => `bar`
4. `.foo.bar` => `baz`

If not recursing, then just 1 and 2. If recursing (the less useful case IMO), then probably 4, 3, 1, 2.

For traverse_values I think we have to decide on three questions:

1. Should traversal use a built-in syntax or be implemented as a function?

After considering the above, I think built-in syntax makes more sense, as it is a control-flow change and not a normal function call. The only motivating factor for function syntax is to get something working quickly. If function syntax does not actually help that, then I see no advantage and several disadvantages.

2. Should we implicitly or explicitly mutate values?

Implicit mutation appeals to the functional side of me, and could possibly be optimized away internally with a comparison before assignment. It would likely result in simplified expressions as well. Consider changing key names, one of the known use cases. This requires both a del and an assignment, and would be messy to optimize within the TRL script if it proves to be desirable.

3. Should we use `$1` or `.` (or `this`, or …) to reference the current value?

No strong preference, other than that overloading . seems to introduce confusion. OTOH none of the other suggestions are obviously less confusing.

binarylogic commented 3 years ago

After coming back to this two weeks later, it's clear that the higher level traversal constructs are confusing. Let's start with individual functions that perform the traversal in a way that's hidden from the user. Like redact_all, replace_keys, and replace_values. I'm closing in favor of those functions.