itadventurer / kafka-backup

Backup and Restore for Apache Kafka
Apache License 2.0
164 stars 46 forks source link

On-disk files encryption #30

Open jay7x opened 4 years ago

jay7x commented 4 years ago

It'd be great to store files on disk in encrypted form. This allows to easily upload data nightly into public cloud storages for long-term storage. No additional encryption is required then. Even good symmetric encryption should be enough for first time.

itadventurer commented 4 years ago

I am not sure if this should be part of Kafka Backup. There are so many ways to do this (wrong) and it depends heavily on your use case thinking This is basically the same as with Kafka: should we really do the encryption ourselves or should we better relying on the underlying systems (OS level encryption, encryption provided by S3 et al)

I will leave it open and mark it as wontfix for now

jay7x commented 4 years ago

I understand your point. Though I'll try to describe my use case again.

As I have directory with streaming backup I'd like to just run azcopy to upload it to Azure blobstore nightly and forget (azcopy is kind of Azure-specific rsync). Though I cannot do it as files may contain customers private data e.g. So I need to encrypt files before uploading them. This is doable though I need more resources on my VM to do it. With builtin encryption I don't need it, so I can go straight with azcopy :)

itadventurer commented 4 years ago

If you are fine loosing up to 24h of data your approach using azcopy is currently the easiest one. I will think more about this because this is quite an interesting use case (combination of #26, #29 #30, #31) ;)

jay7x commented 4 years ago

Yeah, Azure blobstore is last-resort long-term recovery solution. Short term is data directory on VM running kafka-backup :) Though it's still not clear for me is there any benefit of azcopy-ing whole backup directory over tar-ing and uploading it nightly.

Let's say I have kafka-backup running for 1yr on some cluster and storing data locally. So during restore everything will be replayed back to kafka cluster, right? So every consumer will see whole 1yr of data? Or will it be limited to topics ttl only?

jay7x commented 4 years ago

Wrt original issue it may be worked around by eCryptfs/EncFS/fscrypt JFYI. So not that critical now.

jay7x commented 4 years ago

Though it'd be nice to have when kafka-backup got pluggable storage backend support so can write backup directly to Azure blobstore/s3 bucket (as there will be no place to insert encryption then).