Encryption filter configuration

tombentley commented 4 months ago

I wanted a place to discuss the vision for how to configure the encryption filter. I'm using this issue.

Here's an annotated example to illustrate something which I think should work and be flexible:

    - type: EnvelopeEncryption
      config:
        kms: VaultKmsService
        kmsConfig:
          vaultTransitEngineUrl: http://vault.vault.svc.cluster.local:8200/v1/transit
          vaultToken: mytoken
        defaultPolicy:       # <1>
          cipherSpec: AES_128_GCM_128
          keyEncryptionKey: general-key
          aad: 
          - batch_metadata
          encrypt: 
          - recordKey
          - recordValue
          - recordHeaders
        topicPolicies:       # <2>
        - includeTopics:       # <3> 
            named: ["invoice"]       # <4>
            startingWith: ["invoice_"]       # <5>
          policy:       # <6>
            keyEncryptionKey: "${topic.name}"       # <7>
            encrypt: 
            - recordKey           
        - includeTopics:
            named: ["users"]
          policy:
            keyEncryptionKey: "user-${utf8(record.headers['username'])}"       # <8>
        - includedTopics: 
            startingWith: ["user_"]
          policy:
            keyEncryptionKey: "user-${jsonpath(utf8(record.key), '$.username')}"       # <9>

The defaultPolicy applies to any topic that's not matched by an element in topicPolicies (see 2). It's equivalent to specifying an element in topicPolicies with an include prefix of "" (because the empty string is a prefix of all strings). Within this object we can specify things which are currently in the EncryptionScheme record. We use keyEncryptionKey to name the key (within the KMS) to the used for encryption. If the default is not to encrypt at all you can specify an empty defaultPolicy, defaultPolicy: {}. We can decide on sensible defaults of each of the other properties here.
The defaultPolicy can be overridden for specific topics.
Those topics are identified by criteria defined in an includeTopics object. The criteria individually select topics. The policy (see 6) is applied when a topic matches any criterion. It's possible to narrow an inclusion with an exclusion (i.e. a butExcludeTopics property sibling to includeTopics; not shown).
We support inclusion by name.
We support inclusion by name prefix. We could add other criteria later.
A policy object supports exactly the same schema as the defaultPolicy. You can say policy: {} to mean "no encryption" in the same way.
As well as supporting literal key names the keyEncryptionKey is subject to interpolation of expressions within ${ and }.
We will eventually want to support more sophisticated interpolation expressions. This example shows looking up a header value by header key and converting it to text using a function called utf8. In this example, its evaluation might result in using the key encryption user-tombentley. We don't have to agree on the expression language as part of this discussion.
This is another hypothetical example showing a function called jsonpath. jsonpath parses its first argument as JSON and then returns the result of evaluating the jsonpath given in its 2nd argument against that JSON structure. This example demonstrates a possibly mechanism for getting KEK aliases from within fields of a record key. In this example, its evaluation might also result in using the key encryption user-tombentley

Some properties of this proposal to highlight:

The defaultPolicy and topic policy schemas can evolve over time if we add more features. In particular they cannot collide with future changes to the topic inclusion/exclusion mechanism
The inclusion and exclusion mechanisms can evolve in the future. They don't have the same schema.
We would apply some checks to the inclusions and exclusions to avoid ambiguity.
- No dupes in a named or startingWith. e.g. named: [foo, foo] is not allowed.
- The intersection of all named must be empty. e.g. you can't have two topic policies both with named: foo.
- The intersection of all startingWith must be empty. e.g. you can't have two topic policies both with startingWith: foo.
- When there exists a named that's equal to a startingWith then the named matches. e.g. it is allowed have one named: foo and another startingWith: foo, but this should result in a WARN being logged, because of the scope for confusion.
- When there exists a startingWith that's a prefix of another startingWith then the longer (more specific) startingWith matches.
We have plenty of wriggle-room to figure out the template expressions in the future, and these can add a lot of power/flexibility. I don't think we should fixate on specifying this in any detail right now. However, some points worth noting:
- Either it's the same syntax as the config interpolation mechanism proposed in #964, or it's different. If it's different it should look different (e.g. not use the same delimiters). If it's the same, we would want the ability to evaluate the env var and system property expressions at config loading time, but defer the evaluation of topic and record expressions till produce-time (in which context we have a topic and a record).
- The functions supported are probably not pluggable. It's better to have a few well-thought-out functions, which are sympathetic to the Kafka ecosystem.
- Although the examples imply the existence of some particular schema for records in the user_* topics, it's not proposed
- This would be an expression language only. No support for variables or control flow.
- We might want to support some way of sanity-checking the evaluated expressions (e.g. against a regex) to avoid spamming the KMS when we counter bad data.
- Ideally we'd analyse the expressions to determine their dependency on the topic and record. Doing this would allow some optimizations. If the expression doesn't depend on the record then there's no need to iterate and decompress the batch.

tombentley commented 4 months ago

This relates also to https://github.com/kroxylicious/kroxylicious/pull/991 and https://github.com/kroxylicious/kroxylicious/pull/1036

robobario commented 4 months ago

Is there any inheritance type relationship between defaultPolicy and topicPolicies[*]policy? Like does setting the cipherSpec in defaultPolicy impact the topicPolicies? Or is it more that cipherSpec would have a default value and doesn't need to be set in any particular policy.

tombentley commented 4 months ago

Is there any inheritance type relationship between defaultPolicy and topicPolicies[*]policy? Like does setting the cipherSpec in defaultPolicy impact the topicPolicies? Or is it more that cipherSpec would have a default value and doesn't need to be set in any particular policy.

I was thinking that:

There would be built-in defaults for at least some of the properties of the defaultPolicy (inevitable anyway, when we added features requiring config parameters after adopting this schema, in order to guarantee backwards compatibility when reading older config files). As an example, if you don't specify defaultPolicy.aad then it would be as if you had set defaultPolicy.aad to that default. I don't think we could have a built-in default for keyEncryptionKey, at least.
The topicPolicies override what's in the defaultPolicy. So if you don't set something at the topic level and then change the default then you'd get the changed thing for that/those topic(s).

I suppose there's the possibility for some confusion, because I've said that the empty policy {} means "no encryption", whereas

if every property in the defaultPolicy were optional then you'd expect it to mean "the built-in defaults".
at the topic level you'd expect {} to mean, "the default policy", and unlike the case for defaultPolicy, {} ought to be legal at the topic level.

So maybe we should make "no encryption" more explicit, by using keyEncryptionKey: null. When it's null then none of the other properties would be allowed. Wdyt?

SamBarker commented 4 months ago

Thanks @tombentley thats quite a full and well thought through proposal.

I have a couple of high(er) level thoughts. It feels like a complicated schema which can apply to all our encryption use cases, but I'm not sure we need or even want one schema which expresses them all! While its appealing from an implementation perspective I'm not sure its helpful to users. They might be better served by having simpler schemas which are more directly connected to each of the filters. I wouldn't have a problem if we were to convert the dedicated schemas into a model like this internally

~In your model how would a user express I want a default encryption policy of X but I only want it applied to certain topics. As the default is applied everywhere not otherwise matched.~ Oh wait you do cover that in point 6.

My other thought is instead of (or maybe as well as) a defaultPolicy key we should offer a map of policy names to policies, and then allow them to be referenced in the topicPolicies section. That would separate the concerns of policy definition from the topics it applies to and allow groups of topics to share policies. Which would help in the case of a multi tenant deployment where only specific topics would be encrypted but all would have a common prefix.

Hmm that leads to another thought (its not really new per se). There is actually an ordering requirement between recordEncryption and any filter that does topic name manipulation. Say we put the multi tenant filter first we are going to screw up all prefix matching. Does that mean we need another mapping mechanism, such as header to key? (the topic mangling filter could add a original_topic_name header and the record encryption filter uses that to find the policy).

I can see the appeal of all the various topic matchers and expression language support but I'm a little concerned about the performance costs of all of that. Particularly for large clusters shared between multiple app teams that can self serve topics and policies. Its probably a suck it and see situation but I wanted to give voice to the thought.

Some more detail oriented points.

When there exists a startingWith that's a prefix of another startingWith then the longer (more specific) startingWith matches.

Should we also log a warn in this case as there is the same possibility for confusion as the named & starts_with clash?

You've used both includeTopics and includedTopics as keys I'm presuming includeTopics is a typo?

tombentley commented 4 months ago

Thanks for responding @SamBarker

They might be better served by having simpler schemas which are more directly connected to each of the filters

I don't quite understand what you mean. Could you elaborate?

My other thought is instead of (or maybe as well as) a defaultPolicy key we should offer a map of policy names to policies, and then allow them to be referenced in the topicPolicies section.

It's not a bad idea, but isn't that what YAML anchors and references are intended to support? There seems to be mixed reports of Jackson's support for this but it might be worth experiementing with it before deciding on this point.

Which would help in the case of a multi tenant deployment where only specific topics would be encrypted but all would have a common prefix.

I think there are different possible models for multi-tenant + encryption:

KMS-per-tenant
Token-per-tenant
Including the tenant id in the key names, e.g. with the defaultPolicy only
Supporting different policies for different tenants

...

There is actually an ordering requirement between recordEncryption and any filter that does topic name manipulation. Say we put the multi tenant filter first we are going to screw up all prefix matching.

... Given the above I think you mean something like the latter,. To be honest I don't have a good feeling for how often real user requirements would mandate that. If it were common it's harder to say it's towards the RHS of "simple things should be simple, hard things should be possible". But it's a good example to show how the matching approach can break down when what you're trying to match depends on other filters. Hmm.

I can see the appeal of all the various topic matchers and expression language support but I'm a little concerned about the performance costs of all of that. Particularly for large clusters shared between multiple app teams that can self serve topics and policies. Its probably a suck it and see situation but I wanted to give voice to the thought.

Performance is secondary right now, to understanding how a model like this (or any other model) applies in various use cases. And the MT + encryption example is a good one.

When there exists a startingWith that's a prefix of another startingWith then the longer (more specific) startingWith matches.

Should we also log a warn in this case as there is the same possibility for confusion as the named & starts_with clash?

That feels more likely to be intentional to me. It's the mechanism for carving out an exception to a more general rule; it would apply to the defaultPolicy and any other prefix matcher, for instance.

I'm presuming includeTopics is a typo?

Thanks, consistency restored.

SamBarker commented 4 months ago

They might be better served by having simpler schemas which are more directly connected to each of the filters

I don't quite understand what you mean. Could you elaborate?

What I'm trying to get at is we should think about specific schemas /config models for each application of encryption rather than one model to fit all use cases. Looking at your proposal with fresh eyes its maybe not quite as generically applicable as I thought last night. However to follow through on the thought:

I can easily imagine the encrypt node in the policy being expanded to cover fields within a message. Which would absolutely make sense to do for a field level encryption filter but would just be confusing in the recordEncryption filter.
The expression language support for. things like keyEncryptionKey: "user-${utf8(record.headers['username'])}" might be better served by an explicit
```
keyPerHeader:
header: 'username'
kekId: ''
persistIn: 'jdbc://mongoDb'
```
I'm not suggesting that is a real object but just that we might want different properties in different contexts. As I say I probably treated it as overly generic previously.

It's not a bad idea, but isn't that what YAML anchors and references are intended to support? There seems to be mixed reports of Jackson's support for this but it might be worth experiementing with it before deciding on this point.

YAML anchors might work well for people hand editing YAML but I don't think they provide a great option for people trying to manage the config through a higher level operator/platform. How does it know which policies need anchors? How does it generate an anchor? Its going to want to provide a set of known policies etc I think a policy map provides simpler way to interact with the config without really costing us in complexity.

To be honest I don't have a good feeling for how often real user requirements would mandate that. If it were common it's harder to say it's towards the RHS of "simple things should be simple, hard things should be possible".

I don't have a good picture either, but a lot of the interest I've seen suggests it might be quite common. As I think the multi tenant filter plays nicely into the platform engineering movement if we can address some of the noisy neighbour issues it could well be quite a common deployment pattern.

tombentley commented 4 months ago

What I'm trying to get at is we should think about specific schemas /config models for each application of encryption rather than one model to fit all use cases.

OK, that makes more sense, thanks for clarifying.

The problem I have with that is it's not clear how many or few those use cases really are, and how discrete they are. I suspect they form a spectrum which isn't very amenable to specific schemas. Using your example, what if you want to select a kek based on both the user name (in a header) and something else? I'm sure we could say "people can write their own plugins", but I think the person who wants to configure this might be more ops than dev and might not relish having to whip out an IDE to write some java plugin just to be able to say kekEncryptionKey: ${topicname}-${user}. So maybe that works as a strategy, or maybe not.

The things I liked about the expression language idea were:

It's 'easy' to pass variables like record and topic to be used in the expression however the user wants to use them.
It's backwards compatible to add new ones later (e.g. channel, principal).
- And maybe that's a way to support MT. For example there could be a tenant variable if there's a MT filter in the chain.
There's a lot of power in adding functions to support deeper inspection of record headers, keys and values. Adding functions is also backwards compatible.
And because these things are a language they compose really nicely.
With a little care we could guarantee that these expressions are pure and definitely terminate, which seem like valuable properties (and which isn't the case for some plugin written by the user).

On the other hand, I'm not aware of any off the shelf expression language with exactly the features I'd like.

I can easily imagine the encrypt node in the policy being expanded to cover fields within a message. Which would absolutely make sense to do for a field level encryption filter but would just be confusing in the recordEncryption filter.

Yes, I wrote it with that possibility in mind. But let's not get side tracked on that right now.

k-wall commented 4 months ago

@tombentley thanks for putting the proposal together. I really like it.

keyEncryptionKey - should that be keyEncryptionKeyAlias or simply alias to adhere to the naming scheme already used by the KMS

What if an expression such as ${utf8(record.headers['username'])} failed to resolve at runtime? e.g. there was no username field in the header or the username resulted in invalid UTF-8. Presumable we'd fallback to the defaultPolicy or reject the batch of messages? We'd need to think about how to let the user know about the runtime failures.

If we had such flexibility in deriving the keyAlias from the record, it wouldn't be long before there was a feature request for the ability to create a key in the KMS on-the-fly. I'd see it is a separate piece of work, but I wanted to mention it here.

tombentley commented 3 months ago

keyEncryptionKey - should that be keyEncryptionKeyAlias or simply alias to adhere to the naming scheme already used by the KMS

It's an alias, so that probably deserves to be in the name. And keyEncryptionKeyAlias is too verbose to make the user repeat. And KEK is not a very pretty acronym. So maybe alias, or keyAlias is a good choice.

What if an expression such as ${utf8(record.headers['username'])} failed to resolve at runtime? e.g. there was no username field in the header or the username resulted in invalid UTF-8. Presumable we'd fallback to the defaultPolicy or reject the batch of messages? We'd need to think about how to let the user know about the runtime failures.

Exceptions are a bit tricky. If the header is missing, then I think we should reject that batch of messages, because it indicates that the expectations of the person configuring the system haven't been met (i.e. it's most likely a configuration error). Invalid record data is slightly different (the producer is producing something unexpected), but I think we should also reject.

If we had such flexibility in deriving the keyAlias from the record, it wouldn't be long before there was a feature request for the ability to create a key in the KMS on-the-fly. I'd see it is a separate piece of work, but I wanted to mention it here.

Absolutely it's a separate piece of work.

tombentley commented 3 months ago

@gracegrimwood I'm interested in your PoV on this one too.

tombentley commented 3 months ago

Another thing... this proposal uses ${topic.name} and in doing so establishes a precedent which would allow us to add support for things like record.headers, record.key and record.value later on. But we actually support ${topicName} currently. We should decide whether we actually want to switch, and allow users to write topic.name in 0.5.0.

ddaviesbrackett-dwp commented 1 month ago

piggybacking a small feature request onto this issue - RecordEncryption currently hardcodes 5 million encryptions per DEK and it would be nice if that value were configurable - or, failing that, if it were set low enough not to encrypt more data than would weaken the GCM key given a suitably pessimistic average message size.

After reading that last half sentence, I think it'd be better for it to be configurable. I don't personally have a use-case for having different DEK rotation requirements for different levels of sensitive data, so a single value is enough for me.

edit by @k-wall, I think this deserves to be a separate issue. Raised as https://github.com/kroxylicious/kroxylicious/issues/1289

ddaviesbrackett-dwp commented 4 weeks ago

another other thing - given #1226 defines ways to select which filters apply, does there need to be more than one policy per encrypting filter? i.e. once topic selectors are a thing, would it not be better to define multiple encryption filters each with a disjoint set of topics it cares about?

kroxylicious / kroxylicious

Encryption filter configuration #1042