hpgrahsl / kryptonite-for-kafka

Kryptonite for Kafka is a client-side 🔒 field level 🔓 cryptography library for Apache Kafka® offering a Kafka Connect SMT, ksqlDB UDFs, and a standalone HTTP API service. It's an ! UNOFFICIAL ! community project
83 stars 6 forks source link

Decryption using python and aws kms #19

Open panditrahulsharma opened 9 months ago

panditrahulsharma commented 9 months ago

Hello @hpgrahsl, we are planning to create an encryption/decryption architecture using kryptonite-for-kafka in Debezium source connector but facing some issues mentioned below:

  1. I have successfully produced encrypted data in Kafka using the kryptonite transformation package, but I want to decrypt this data using Python/PySpark. How can I achieve it? As per my understanding, in your code, you have used Kryo serialization, but this is not available in Python. Can you please help me with this or provide me a sample Python script for decryption?

  2. How can we pass AWS KMS key payload directly in the source connector?

    transforms.cipher.cipher_data_keys: {
      "KeyMetadata": {
          "AWSAccountId": "123456789012",
          "KeyId": "arn:aws:kms:us-east-1:123456789012:key/abcd1234-a123-456a-a12b-a123b4cd56ef",
          "Arn": "arn:aws:kms:us-east-1:123456789012:key/abcd1234-a123-456a-a12b-a123b4cd56ef",
          "CreationDate": 1642604273.418,
          "Enabled": true,
          "Description": "",
          "KeyUsage": "ENCRYPT_DECRYPT",
          "KeyState": "Enabled",
          "Origin": "AWS_KMS",
          "KeyManager": "CUSTOMER",
          "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
          "EncryptionAlgorithms": [
              "SYMMETRIC_DEFAULT"
          ],
          "SigningAlgorithms": [
              "RSASSA_PSS_SHA_512"
          ]
      }
    }
  3. How to use field-level keys (different keys for different fields)?

    exm: tabl1 has three column c1,c2 and c3 i want to encrypt those column with three different keys

5.I have a single source connector for multiple fact tables then, how to configure the transforms.cipher.field_config parameter for different tables with different fields?

table.include.list: 'dbo.table1,dbo.table2,dbo.table3,...dbo.tableN'
encrypt.fields.table1: 'mobile'
encrypt.fields.table2: 'userid'

Hope you will provide a response with sample examples.

hpgrahsl commented 9 months ago

Hi @panditrahulsharma,

THX for reaching out with some questions about the project. I'm answering below:

  1. At the moment there is no direct way that this project allows you to decrypt the data in python using pyspark. You're right that the kryo serialization is not available in python. There is two approaches that you could take:

a) Kryptonite can support other serialization mechanisms that you can implement on your own, e.g. a simple one would be to add your own JSON serialisation. Doing that you could then use tink (the crypto library from google that this project is based on) and decrypt the data natively in python.

b) There is funqy-http-kryptonite https://github.com/hpgrahsl/kryptonite-for-kafka/blob/master/funqy-http-kryptonite/README.md which you could run and use via HTTP from python to decrypt data. Whether this is a viable option for your use case in the context of pyspark jobs is something you have to try out.

  1. Kryptonite uses tink and hence follows the tink keyset specification for which you see brief example/description here https://github.com/hpgrahsl/kryptonite-for-kafka/tree/master/connect-transform-kryptonite#tink-keysets. This means the key material needs to be specified in that format either as plain or encrypted keysets. You cannot use AWS KMS key payload directly.

  2. You can define different key materials for different payload fields. Each field_config entry can specify a keyId to refer to different key materials.

  3. If you process multiple different tables with one connector config you can make use of predicates and use these to apply the kryptonite SMT with specific settings to payloads from different topics.

panditrahulsharma commented 9 months ago

@hpgrahsl thanks for your response. All points are sorted except for JSON serialization or other serializations which Python also supports. Can you please add this support to this framework? My team is only working in Python and PySpark. It would be a great start if you could provide this solution.

hpgrahsl commented 8 months ago

@panditrahulsharma great to hear it was helpful! at the moment I don't have the time to work on that but in general it'll be a good thing to have going forward. so I keep that in the "backlog" for upcoming releases.

until then I want to highlight that you could try to make use of funqy kryptonite and use it via HTTP from python.

also another interesting approach that I might add is to have support in Spark SQL directly based on a custom UDF. I recently built a poc with Flink SQL and custom UDFs. Works quite nicely.

That being said, contributions are always welcome. So if you'll want to help implement any of these let me know. Happy to provide some guidance.