pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

Decrypting file in read_csv #44097

Open linogaliana opened 3 years ago

linogaliana commented 3 years ago

Basic idea of the feature request

I was trying to read an encrypted file in pandas. As far as I know, there is no way to provide something to read_csv (or any other read_* function) to decrypt a file when reading (and not with ex-post applymap functions as in this stack overflow thread)

The solution proposed in the aforementioned post seems quite slow with data > several Mb.

My solution has been to decrypt the file using cryptography package and write that in a temporary location (there's room for improvement in the functions I will propose below, I am aware of that). This works but I was hoping this would be better to have an option in pandas to decrypt when reading the stream input. This would probably lead to:

Here an example that makes possible to reproduce the feature:

  1. The encrypt_data is just here to reproduce the setting of having a crypted file
  2. It would be great to avoid the decrypt_data step to directly use read_csv with an extra argument.
import pandas as pd
from cryptography.fernet import Fernet

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})
df.to_csv("toto.csv")

def encrypt_data(path, key, outpath = None):

    if outpath is None:
        outpath = '{}_encrypted'.format(path)

    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    # encrypting the file
    encrypted = f.encrypt(original)  
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as encrypted_file:
        encrypted_file.write(encrypted)

    print("file {} encrypted ; written at {} location".format(path, outpath))

def decrypt_data(path, key,  outpath = None):
    if outpath is None:
        outpath = '{}_encrypted'.format(path)
    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    decrypted = f.decrypt(original)
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as dfile:
        dfile.write(decrypted)
    print("file {} decrypted ; written at {} location".format(path, outpath))

dummykey = Fernet.generate_key()
encrypt_data("toto.csv", dummykey, outpath = "toto_crypt.csv")
decrypt_data("toto_crypt.csv", dummykey, outpath = "toto_decrypt.csv")

pd.read_csv("toto_crypt.csv")
pd.read_csv("toto_decrypt.csv")

A possible approach

Let's say we call this argument encryption. We could provide an object from cryptography to decode datastream directly in pd.read_csv call. For instance:

pd.read_csv("toto_decrypt.csv", encryption = Fernet(dummykey))

The same approach could be used to to_csv (or other writing functions) to directly write encrypted data in the disk.

However, maybe this solution would imply to use the python engine. Directly providing the key and the encryption method (e.g. Fernet) is maybe better to work with the C engine (I am not familiar with C but there's probably equivalent method than the one I applied in python)

API breaking implications

As far as I understand how I/O works, I think this extra argument would not break any existing code with a default value to None.

jreback commented 3 years ago

not really in favor of this as out of scope here. adding complexity w/o much value. That said if a fully formed PR that works generally, wouldn't object.

could be a doc recipe instead.

twoertwein commented 3 years ago

I'm not familiar with the 3rd-party cryptography library. If it provides you with a file handle, you can simply pass that to pd.read_csv.

twoertwein commented 3 years ago
  • improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose

Instead of writing the content (str/bytes) to a file, you can simply wrap it inside io.StringIO or io.BytesIO and then give that to read_csv.

linogaliana commented 3 years ago

Thanks for the quick reply.

I understand the maintainer point of view that it is not necessary to add extra complexity if not needed. I agree with @jreback that it would maybe make more sense as a doc recipe.

I will have a look to io.StringIO or io.BytesIO, maybe this would avoid my overcomplicated solution. If I'm happy about it, I will make a PR for adding that to the documentation.

slremy commented 2 years ago

Hello all, is the suggestion that this should be implemented as a method which returns a file-like object?

ala

pd.read_csv(decrypt_data("toto_crypt.csv", dummykey)) or pd.read_csv(decrypt_data("https://server/toto_crypt.csv", dummykey))

twoertwein commented 2 years ago

Pandas has convenient methods for compression, but I think adding a particular non-stdlib en/decryption packages might be a very niche feature which might not be able to justify the added complexity.

I think the best solution would be if cryptography.fernet implements a function to return a decrypted file handle. This should work with read_csv/to_csv/...