openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

sending encrypted data with -w does not actually dedup on receiving end #11778

Open daanw1978 opened 3 years ago

daanw1978 commented 3 years ago

System information

Type Version/Name
Distribution Name Debian Linux
Distribution Version 10.8 / Buster
Linux Kernel 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

Sending encrypted data using 'zfs send -w' to a dataset with 'dedup=on' does not actually dedup data on the receiving end. When sending encrypted data that is also deduplicated on the source dataset, data is deduplicated is on the receiving end as expected.

Describe how to reproduce the problem

Create a datatset data with encryption=on and dedup=off. Send to an empty pool tank2 with: zfs send -w tank/data_enc@now | zfs recv -o dedup=on tank2/test

After completing: zfs get dedup tank2/test shows dedup=on zpool list tank2 shows DEDUP 1.00x

Create a datatset data_dedup with encryption=on and dedup=on. Send to an empty pool tank2 with: zfs send -w tank/data_enc_dedup@now | zfs recv -o dedup=on tank2/test

After completing: zfs get dedup tank2/test shows dedup=on zpool list tank2 shows DEDUP 1.30x (or any other value higher than 1.00x depending on the data)

Include any warning/errors/backtraces from the system logs

No error in dmesg output, nor /var/log/syslog nor /var/log/kern.log.

ahrens commented 3 years ago

I think that this is (unfortunately) working as designed. When dedup and encryption are combined, the way the data is encrypted is changed (compared to encryption without dedup), so that the same plaintext results in the same encrypted data. Therefore, if the data has already been encrypted in the normal way (same plaintext -> different encrypted data), it can't be deduplicated, even after turning dedup on.

This could definitely be better explained, and we might be able to throw an error when trying to combine encrypted receive with dedup=on.

cc @tcaputi

daanw1978 commented 3 years ago

Thanks for the quick reply. That answer was in a way to be expected and the explanation seems logical. Too bad though since this removes the possibility of sending encrypted datasets to less trusted/offsite backup servers without a key and store it there deduplicated for maximum storage efficiency using only zfs. Especially since the performance hit of dedup is less of a problem on an unattended backup server. It's pretty much what userspace tools like borgbackup offer (backing up both encrypted and deduplicated).

Regarding the documentation; yes, an explanation of this caveat would be a good addition. The other one that also comes to mind and a bit in the same category is that with encrypted datasets, dedup is only applied for each dataset individually, not spanning across all datasets with dedup on as it would do without encryption.

ahrens commented 3 years ago

I agree.

The other one that also comes to mind and a bit in the same category is that with encrypted datasets, dedup is only applied for each dataset individually, not spanning across all datasets with dedup on as it would do without encryption.

This is covered in the zfs-change-key.8 manpage: Deduplication is still possible with encryption enabled but for security, datasets will only dedup against themselves, their snapshots, and their clones.

daanw1978 commented 3 years ago

I noticed, but I think it would be good if the general openzfs documentation would be more elaborate about these kind of caveats/restrictions regarding dedup and encryption too.

The confusing part here is mainly that when sending with -w and -o dedup on receive option, the receiving dataset says dedup=on but there is no actual deduplication being done. This is counterintuitive and actually also hard to find out about if the given dataset is not the only dataset in the pool (and therefore the pool dedup ratio = dataset dedup ratio).

Regarding dedup being done only per encrypted dataset instead of against the whole pool; since you cannot get the dedup ratio per dataset you simply don't know about these changed mechanics if you didn't read that phrase in the man page. Since the effect on the actual deduplication rate (and therefore the consideration if dedup is actually worth it) can be significant I think it deserves a more prominent spotlight in the general openzfs documentation on dedup.

Just to be sure;

ahrens commented 3 years ago

The confusing part here is mainly that when sending with -w and -o dedup on receive option, the receiving dataset says dedup=on but there is no actual deduplication being done. This is counterintuitive and actually also hard to find out about if the given dataset is not the only dataset in the pool (and therefore the pool dedup ratio = dataset dedup ratio).

Regarding dedup being done only per encrypted dataset instead of against the whole pool; since you cannot get the dedup ratio per dataset you simply don't know about these changed mechanics you didn't read that phrase in the man page. Since the effect on the actual deduplication rate (and therefore the consideration if dedup is actually worth it) can be significant I think it deserves a more prominent spotlight in the general openzfs documentation on dedup.

I agree. I think that PR's to update the documentation to reflect this would be welcome.

when sending encrypted (and/or deduped) data without -w and with -o dedup=on receive option the data is decrypted before sending and the decrypted data is then encrypted and deduped from scratch on the receiving end?

That's right.

when sending encrypted and deduped data with -w and -o dedup=on receive option the raw datastream is stored encrypted/deduped exactly like on the source dataset?

That's right.

when sending encrypted and deduped data with -w and -o dedup=off receive option the data is stored with the same encryption as on the source dataset but the deduplication is neglected and stored un-deduped on the receiving end?

That's my understanding.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

daanw1978 commented 2 years ago

Just a quick followup on this one.

when sending encrypted and deduped data with -w and -o dedup=off receive option the data is stored with the same encryption as on the source dataset but the deduplication is neglected and stored un-deduped on the receiving end?

deduplicated and ecrypted dataset, deduplication rate is about 1.5 zfs send -w with receive option dedup=off The size of the dataset on the receiving end is the same as on the source. It seems that the receiving end stores the data encrypted and deduped after all, otherwise the dataset should have been roughly 1,5x the size of the source. The dedup ratio on the receiving end however shows 1.0, implying the receiving end doesn't know that the received data is deduped. I guess turning dedup=on on the receiving end means deduplicating the stream of the already deduplicated/encrypted data again. I didn't try this but it also really doesn't make sense.

The purpose of my replications are pure backup, so all good if I can store the encrypted/deduped dataset elsewhere this way. I have no intention of mounting the dataset on the receiving end, but I wonder what would happen when all of a sudden the data appears deduplicated after all?