Open linogaliana opened 3 years ago
not really in favor of this as out of scope here. adding complexity w/o much value. That said if a fully formed PR that works generally, wouldn't object.
could be a doc recipe instead.
I'm not familiar with the 3rd-party cryptography
library. If it provides you with a file handle, you can simply pass that to pd.read_csv
.
- improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose
Instead of writing the content (str/bytes) to a file, you can simply wrap it inside io.StringIO
or io.BytesIO
and then give that to read_csv
.
Thanks for the quick reply.
I understand the maintainer point of view that it is not necessary to add extra complexity if not needed. I agree with @jreback that it would maybe make more sense as a doc recipe.
I will have a look to io.StringIO
or io.BytesIO
, maybe this would avoid my overcomplicated solution. If I'm happy about it, I will make a PR for adding that to the documentation.
Hello all, is the suggestion that this should be implemented as a method which returns a file-like object?
ala
pd.read_csv(decrypt_data("toto_crypt.csv", dummykey))
or
pd.read_csv(decrypt_data("https://server/toto_crypt.csv", dummykey))
Pandas has convenient methods for compression, but I think adding a particular non-stdlib en/decryption packages might be a very niche feature which might not be able to justify the added complexity.
I think the best solution would be if cryptography.fernet
implements a function to return a decrypted file handle. This should work with read_csv/to_csv/...
Basic idea of the feature request
I was trying to read an encrypted file in
pandas
. As far as I know, there is no way to provide something toread_csv
(or any otherread_*
function) to decrypt a file when reading (and not with ex-postapplymap
functions as in this stack overflow thread)The solution proposed in the aforementioned post seems quite slow with data > several Mb.
My solution has been to decrypt the file using
cryptography
package and write that in a temporary location (there's room for improvement in the functions I will propose below, I am aware of that). This works but I was hoping this would be better to have an option inpandas
to decrypt when reading the stream input. This would probably lead to:Here an example that makes possible to reproduce the feature:
encrypt_data
is just here to reproduce the setting of having a crypted filedecrypt_data
step to directly useread_csv
with an extra argument.A possible approach
Let's say we call this argument
encryption
. We could provide an object fromcryptography
to decode datastream directly inpd.read_csv
call. For instance:The same approach could be used to
to_csv
(or other writing functions) to directly write encrypted data in the disk.However, maybe this solution would imply to use the
python
engine. Directly providing the key and the encryption method (e.g. Fernet) is maybe better to work with the C engine (I am not familiar with C but there's probably equivalent method than the one I applied in python)API breaking implications
As far as I understand how I/O works, I think this extra argument would not break any existing code with a default value to
None
.