kroxylicious / kroxylicious

An open-source network proxy framework for Apache Kafka
https://kroxylicious.io
Apache License 2.0
101 stars 43 forks source link

[System Test] Add KMS Key rotation system test #1341

Closed franvila closed 5 days ago

franvila commented 1 month ago

Is your feature request related to a problem? Please describe. We have detected that the KMS key rotation is not taken by Kroxylicious, so the messages are not encrypted with a different key (see #1339)

Describe the solution you'd like In order to test that the rotation is taken by Kroxylicious record encryption filter, we need to test if the encrypted message is different than the previous one.

ddaviesbrackett-dwp commented 1 month ago

I think there's several scenarios in which DEK rotation happens that could be tested:

In the first two cases, the test would probably need to set the lifetime (in data volume and in time, respectively) to something very small sos that the DEK could expire over the course of the test. The third one is tricky to test precisely unless the KMS notifies the client application that the key rotation has happened. The time-based expiry is one way to ensure the DEK associated with the 'old' KEK doesn't hang around too long but then the test is indistinguishable from time-based expiry.

franvila commented 3 weeks ago

Currently we are going to start by something basic to go step by step and then increase the number of test cases to increase the coverage.

For that @kroxylicious/developers I'm thinking on adding an Experimental record that contain the current values to refresh the alias/DeK. But I have a question about what do you think it could be the best approach for that:

WDYT?

SamBarker commented 3 weeks ago

I'm thinking on adding an Experimental record that contain the current values to refresh the alias/DeK.

I don't think I understand what your aiming for. Record is an overloaded term here do you mean

  1. A Kafka Record?
  2. A Java Record?

I initially assumed record referred to 1 but that doesn't make a lot of sense especially given the API methods mentioned. So I assume its actually 2.

TestKmsFacade is meant to be an extension of the Kms interface to support the additional activities we need for testing e.g. creating Key Encryption Keys (KEKs) as in normal operations that is not something the production code needs. So if your plan fits into that model then go for it.

However I think what your describing is a way to inject some state into the system tests so I suspect it would be better to keep it somewhere specific to the system tests.

I'm also not keen on Experimental as a type name its too vague and could be applied in far too many contexts (we already have an ExperimentalConfiguration section in the RecordEncryption config model). As I'm not clear what your actually aiming for I don't feel able to offer alternative naming options as yet.

franvila commented 3 weeks ago

Sorry, but my explanation was too vague.

I'm also not keen on Experimental as a type name its too vague and could be applied in far too many contexts (we already have an ExperimentalConfiguration section in the RecordEncryption config model). As I'm not clear what your actually aiming for I don't feel able to offer alternative naming options as yet.

This is actually what I'm aiming for. Find the proper way to define the Experimental section for the system tests. What I need is to refresh the Dek/alias so I have to define the new configuration elements in a Java record.

  1. Define all together in the same record and inject it only for system tests in the Kroxylicious config map.
public record Experimental(
                            @JsonProperty int resolvedAliasExpireAfterWriteSeconds,
                            @JsonProperty int resolvedAliasRefreshAfterWriteSeconds,
                            @JsonProperty int encryptionDekRefreshAfterWriteSeconds,
                            @JsonProperty int encryptionDekExpireAfterWriteSeconds) { 
}
private static String buildEncryptionFilter(TestKmsFacade<?, ?, ?> testKmsFacade, Experimental experimental) {
...
}
  1. Define each configuration in separate java records for each Kms (AWS, Vault) and include it in the TestKmsFacade interface:
For AWS:

public record Experimental(
                            @JsonProperty int resolvedAliasExpireAfterWriteSeconds,
                            @JsonProperty int resolvedAliasRefreshAfterWriteSeconds) { 
}

For Vault:

public record Experimental(
                            @JsonProperty int encryptionDekRefreshAfterWriteSeconds,
                            @JsonProperty int encryptionDekExpireAfterWriteSeconds) { 
}
private static String buildEncryptionFilter(TestKmsFacade<?, ?, ?, ?> testKmsFacade) {
           ....
            testKmsFacade.getExperimentalConfig();
}

@k-wall anything from your side?

UPDATE: From my POV, option 2 does not allow us to configure the expiration/refresh values in an easy way for each test. Normally each test should not configure that as it should take the default value, but for the new test case to test the rotation, we would need to configure those values to avoid waiting 1 hour (I think this is the default value) the Dek to be refreshed/expired.

k-wall commented 3 weeks ago

In order to test that the rotation is taken by Kroxylicious record encryption filter, we need to test if the encrypted message is different than the previous one.

That approach won't work. Kroxylicious is using https://en.wikipedia.org/wiki/Semantic_security. That means if you encrypt the same plaintext twice, you'll get back different ciphertext, even without a key rotation.

The difficult with this test is somehow recover from the ciphertext record what version of the key was use encrypt the record (or at least assert that the version has changed). The system test doesn't know anything about how the KMS actually encrypted the data (that's all beneath the abstraction).

k-wall commented 3 weeks ago

I'm actually not sure I see a way to write this test in a way for that will work for both AWS and Vault.

Delete key version

For Vault, this approach would work:

  1. Produce a message
  2. Rotate key
  3. Produce another message
  4. Delete key version 0 (new TestKmsFacade API using https://developer.hashicorp.com/vault/api-docs/secret/transit#update-key-configuration)
  5. Verify that message 0 is not decryptable (because it is encrypted with key version 0 which no longer exists)
  6. Verify that message 1 is decryptable (because it is encrypted with key version 1 which still exists)

For AWS, the same approach does not work. AFAICS there is no API for delete a key version.

Retrieve key version from ciphertext blob

The other idea is to try to get the key version back from the ciphertext blob in some implementation specific way. In Vault, the version is trivially encoded in the EDEK. We could pull out the v1.

tradesvault:v1:+EfJ977UG1XkjI9yh7vxpgN2E1DKaIkDuxE+eCprVTKr+sskFuChcTe/KpR/c8ZDyP76W3itExmEzLOl����x)�Ũ�z�:S�������tБ��v���

For AWS, it is not so obvious. AWS doesn't document the formation of the ciphertext blob. I don't see any discussion online about how it is formed.

https://docs.aws.amazon.com/kms/latest/APIReference/API_Encrypt.html#API_Encrypt_ResponseSyntax

We could try some experiments to see if we can work out how it behaves.

Or maybe others have better ideas.

franvila commented 3 weeks ago

I know this is not the ideal approach, but I found a pattern that is working for both and would work for different KMS. I'm comparing the similarity of the encrypted message for a KEK and a rotated KEK, using the JaroWinklerDistance class. The results are the following (more than 50 test runs):

So we can set a threshold from where we can decide if the message has been encrypted with the same KEK or not.

Using this approach we eliminate the complexity of taking the version information from somewhere that is not that accessible as we could expect.

k-wall commented 3 weeks ago

Good thinking, but I don't think a probabilistic approach is really what we want.

Could you examine the byte differences between AWS CiphertextBlob for the same plain text with a key that is rotated several times? My guess is that they are probably encoding a version number somewhere. Base64 decode the output and compare the responses. You should be able to run a test using the AWS command line.

If we can spot it and it is reasonably easy to predict, we'll have a technique we can apply to both AWS/Vault. It still won't be ideal as it would rely on the reverse engineered knowledge of an implementation we don't control.

franvila commented 2 weeks ago

After investigating the CiphertextBlob with @k-wall, we reach a deadend. We didn't found any version sent by AWS when encrypting a message.

@kroxylicious/developers do you have any other idea about how to tackle this test case (different than my probabilistic proposal)?

k-wall commented 2 weeks ago

@franvila what do you think to the idea of asking https://support.console.aws.amazon.com/support/home# a well targeted question? Explain the use-case - the fact we are trying to write a test to know that our application is correctly handling the rotate use-case - i.e. it has requested a new DEK and the new DEK is encrypted using the new key version.

k-wall commented 2 weeks ago

The other thing that comes to mind is that we write the system test so that it skips on AWS. Testing against the one KMS is better than none and the guts of Kroxylicious Record Encryption is shared code.

franvila commented 2 weeks ago

@franvila what do you think to the idea of asking https://support.console.aws.amazon.com/support/home# a well targeted question? Explain the use-case - the fact we are trying to write a test to know that our application is correctly handling the rotate use-case - i.e. it has requested a new DEK and the new DEK is encrypted using the new key version.

Done. I have created an issue to ask for any possible solution to get the version of the KEK used after a rotation.

Anyway, if there is not ideas about how to test it and there is no response soon from AWS support, we could opt for skipping AWS for that test case if the probabilistic option is not suitable.

franvila commented 2 weeks ago

As we don't have access to the metadata of the KEK (for Vault we are lucky it is sent at the beginning of the encrypted message), we can simplify the system test by testing the refreshing/expiring of the DEK at unit/integration test level and then at system test level check that the messages are well received after a rotation by the final user.

WDYT?

SamBarker commented 2 weeks ago

The only other ~bright~ almost sane idea I've had is we build a an encrypted record reader in the system tests and that consumes the records from Kafka itself extracts the cipher text and validates that it can decrypt the new record by specifying the KEK reference itself rather than using the reference the proxy embedded.

franvila commented 2 weeks ago

The only other ~bright~ almost sane idea I've had is we build a an encrypted record reader in the system tests and that consumes the records from Kafka itself extracts the cipher text and validates that it can decrypt the new record by specifying the KEK reference itself rather than using the reference the proxy embedded.

Extract the ciphertext is not an easy task.

For those reasons I suggested to make this kind of tests into the integration tests, as we could have this ciphertext information easily as we are inside the Kroxylicious application and proceed into the system tests with a simpler tests with the information that we have.

SamBarker commented 2 weeks ago

I don't think it is entirely out of scope for the system tests to talk to the underlying Kafka brokers (as thats something a real world client could do). In theory the wire format is documented (or at least supposed to be) so a 2nd implementation that reads the encrypted record from Kafka and decrypted itself is possible.

That said it probably is better handled as an integration test with a tighter integration.

k-wall commented 1 week ago

The only other bright almost sane idea I've had is we build a an encrypted record reader in the system tests and that consumes the records from Kafka itself extracts the cipher text and validates that it can decrypt the new record by specifying the KEK reference itself rather than using the reference the proxy embedded.

The objective is to prove that the filter is using the rotated key material for the encrypt path. For Vault, the version is trivially in the cipher text blob (:v1) so we could verify that the new record has been encrypted with the new version. However, for AWS, we haven't been successful working out how it is storing which version of the key material was used to produce the cipher text blob.

The other thing that comes to mind is that we write the system test so that it skips on AWS. Testing against the one KMS is better than none and the guts of Kroxylicious Record Encryption is shared code.

that's why I made this suggestion ^^^^

SamBarker commented 1 week ago

The objective is to prove that the filter is using the rotated key material for the encrypt path. Yeah I screwed my terminology up. What I was meaning was if the test controlled the rotation and thus knows which are the DEK's from which version of the KEK it can manually ensure that the message decrypts with the appropriate DEK and not with the others. (though I'm not entirely convinced that really works).

That said I've no objections to having a test that only works with Vault (as long as we have a comment explaining why).

franvila commented 1 week ago

Received this response from AWS customer support:

After checking upon this internally, I can confirm that currently there is no such API or method available from which you can extract the information of the KMS key that is used to create the data key. 
The metadata that is added to the ciphertextblob can only be accessed by the KMS service.

So there is currently no way to get the version.

k-wall commented 1 week ago

Received this response from AWS customer support:

After checking upon this internally, I can confirm that currently there is no such API or method available from which you can extract the information of the KMS key that is used to create the data key. 
The metadata that is added to the ciphertextblob can only be accessed by the KMS service.

So there is currently no way to get the version.

Not the response we hoped for, but at least we now know we haven't misread the docs.