uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Parquet column/modular encryption support for Petastorm #736

Open RobindeGrootNL opened 2 years ago

RobindeGrootNL commented 2 years ago

Recently, parquet added support for columnar/modular encryption in version parquet-mr 1.12 (IBM, GitHub), meaning that only the footer file and certain columns of the parquet file are encrypted, minimising encrypt/decrypt overhead while keeping sensitive columns safe. Since Delta is built on top of parquet files and this encryption method became available in Spark 2.3.0 with improved support in Spark 3.0.0, I was wondering if Petastorm already supports this due to being built on the parquet format or if there is a way to make it work. For my application that would be the perfect combination considering current and probable future EU regulation around encryption and cloud storage!

df.write.option("parquet.encryption.footer.key", "k1").option("parquet.encryption.column.keys", "k2:DeviceName,image_id").parquet(STORE_PATH

This is how it works with spark dataframes in Databricks (which is what I am using).

An example of reading the data is as follows (source: IBM):

sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
                   "k1:AAECAwQFBgcICQoLDA0ODw== ,  k2:AAECAAECAAECAAECAAECAA==")
sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
                   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
// Read encrypted dataframe files
val df2 = spark.read.parquet("/path/to/table.parquet.encrypted")

This works beautifully in spark dataframes, but for my use it would be great to use Petastorm as I am looking to train a Neural Network in Tensorflow on this data. Apache Arrow already has support for this at least in C++, and I've seen some discussion about implementing it in PyArrow though I can't see if it happened already.

Any help with getting column encryption to work with Petastorm would be highly appreciated, but learning that this is no (yet) supported and that there is no workaround would also be helpful since I can then look at different options. Thanks in advance!

selitvin commented 2 years ago

Petastorm uses pyarrow to read parquet. Once pyarrow has the support for this feature, we can make sure petastorm allows you to use it. The link to the pyarrow discussion you have provided is from 1.5 years ago. I wonder if anything happened since then.

RobindeGrootNL commented 2 years ago

Thanks for the response! I see that the folks over at PyArrow are working on a PR that supports this encryption/decryption functionality. Once they release this functionality, what do you expect the complexity would be of implementing this in Petastorm? Just to get an idea of what timeframe we would be talking about.

selitvin commented 2 years ago

Don't think it should be too bad. Depends on how it would look like, we might need to forward a parameter or two to pyarrow api calls? Let's keep the issue open.If you can let me know when PyArrow are ready with this feature, we could pick it up from there (also, feel free to take a shot at proposing a petastorm PR - might be faster that way).

RobindeGrootNL commented 2 years ago

It indeed looks like that, that we should pass some parameters to PyArrow which it then passes on to the parquet library. Implementing support for writing parquet files with encryption and still allowing the Petastorm metadata to be computed might be a bit more complex but as long as make_batch_reader() is able to read an encrypted parquet file I would already be happy. I'll keep an eye on the pull request and let you know when it's merged and released, and then I'll also see if I can add something though my knowledge of the inner workings of Petastorm is indeed limited.

   decryption_properties = crypto_factory.file_decryption_properties(
                                                    kms_connection_config)
   parquet_file = pq.ParquetFile(filename,
                                 decryption_properties=decryption_properties)

In order to create the encryption and decryption properties, a
:class:`~pyarrow.parquet.CryptoFactory` should be created and initialized with
KMS Client details, as described below.

    with pq.ParquetWriter(path,
                          table.schema,
                          encryption_properties=file_encryption_properties) \
            as writer:
        writer.write_table(table)

source

selitvin commented 2 years ago

Sounds good.

RobindeGrootNL commented 2 years ago

The pull request has been committed into the master branch of pyarrow but has not been part of an official release yet. I asked the developers if they had a roadmap for this but so far I have not received a reply.

To me it looks like a fairly simple change for petastorm as I think we can get away with just passing the encryption_properties to the parquet reader and already call crypto_factory.file_encryption_properties inside own code or add crypto_factory.file_encryption_properties to the petastorm code. I'll see if I have some time tomorrow to work on a pull request but I'm curious to hear what you think! Incorporating this into the writer will be a bit more complicated I think due to the column metadata also being encrypted in this parquet encryption scheme, but maybe it's possible as well, though for me not the priority since make_batch_reader() can be used on non-petastorm parquet datasets.

(Most helpful files to look at imo are test_encryption.py and sample_vault_kms_client.py )

selitvin commented 2 years ago

I think this is a good plan to try. I think taking care of make_batch_reader is sufficient. I am not sure how much of our user base are actually the writer part of petastorm.

RobindeGrootNL commented 2 years ago

That makes sense, then that is the focus now and I hope to have enough time to make a PR for implementing this this week.