Open daanw1978 opened 3 years ago
I think that this is (unfortunately) working as designed. When dedup and encryption are combined, the way the data is encrypted is changed (compared to encryption without dedup), so that the same plaintext results in the same encrypted data. Therefore, if the data has already been encrypted in the normal way (same plaintext -> different encrypted data), it can't be deduplicated, even after turning dedup on
.
This could definitely be better explained, and we might be able to throw an error when trying to combine encrypted receive with dedup=on.
cc @tcaputi
Thanks for the quick reply. That answer was in a way to be expected and the explanation seems logical. Too bad though since this removes the possibility of sending encrypted datasets to less trusted/offsite backup servers without a key and store it there deduplicated for maximum storage efficiency using only zfs. Especially since the performance hit of dedup is less of a problem on an unattended backup server. It's pretty much what userspace tools like borgbackup offer (backing up both encrypted and deduplicated).
Regarding the documentation; yes, an explanation of this caveat would be a good addition. The other one that also comes to mind and a bit in the same category is that with encrypted datasets, dedup is only applied for each dataset individually, not spanning across all datasets with dedup on as it would do without encryption.
I agree.
The other one that also comes to mind and a bit in the same category is that with encrypted datasets, dedup is only applied for each dataset individually, not spanning across all datasets with dedup on as it would do without encryption.
This is covered in the zfs-change-key.8 manpage:
Deduplication is still possible with encryption enabled but for security, datasets will only dedup against themselves, their snapshots, and their clones.
I noticed, but I think it would be good if the general openzfs documentation would be more elaborate about these kind of caveats/restrictions regarding dedup and encryption too.
The confusing part here is mainly that when sending with -w
and -o dedup on
receive option, the receiving dataset says dedup=on but there is no actual deduplication being done. This is counterintuitive and actually also hard to find out about if the given dataset is not the only dataset in the pool (and therefore the pool dedup ratio = dataset dedup ratio).
Regarding dedup being done only per encrypted dataset instead of against the whole pool; since you cannot get the dedup ratio per dataset you simply don't know about these changed mechanics if you didn't read that phrase in the man page. Since the effect on the actual deduplication rate (and therefore the consideration if dedup is actually worth it) can be significant I think it deserves a more prominent spotlight in the general openzfs documentation on dedup.
Just to be sure;
-w
and with -o dedup=on
receive option the data is decrypted before sending and the decrypted data is then encrypted and deduped from scratch on the receiving end?-w
and -o dedup=on
receive option the raw datastream is stored encrypted/deduped exactly like on the source dataset?-w
and -o dedup=off
receive option the data is stored with the same encryption as on the source dataset but the deduplication is neglected and stored un-deduped on the receiving end?The confusing part here is mainly that when sending with -w and -o dedup on receive option, the receiving dataset says dedup=on but there is no actual deduplication being done. This is counterintuitive and actually also hard to find out about if the given dataset is not the only dataset in the pool (and therefore the pool dedup ratio = dataset dedup ratio).
Regarding dedup being done only per encrypted dataset instead of against the whole pool; since you cannot get the dedup ratio per dataset you simply don't know about these changed mechanics you didn't read that phrase in the man page. Since the effect on the actual deduplication rate (and therefore the consideration if dedup is actually worth it) can be significant I think it deserves a more prominent spotlight in the general openzfs documentation on dedup.
I agree. I think that PR's to update the documentation to reflect this would be welcome.
when sending encrypted (and/or deduped) data without -w and with -o dedup=on receive option the data is decrypted before sending and the decrypted data is then encrypted and deduped from scratch on the receiving end?
That's right.
when sending encrypted and deduped data with -w and -o dedup=on receive option the raw datastream is stored encrypted/deduped exactly like on the source dataset?
That's right.
when sending encrypted and deduped data with -w and -o dedup=off receive option the data is stored with the same encryption as on the source dataset but the deduplication is neglected and stored un-deduped on the receiving end?
That's my understanding.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Just a quick followup on this one.
when sending encrypted and deduped data with
-w
and-o dedup=off
receive option the data is stored with the same encryption as on the source dataset but the deduplication is neglected and stored un-deduped on the receiving end?
deduplicated and ecrypted dataset, deduplication rate is about 1.5 zfs send -w with receive option dedup=off The size of the dataset on the receiving end is the same as on the source. It seems that the receiving end stores the data encrypted and deduped after all, otherwise the dataset should have been roughly 1,5x the size of the source. The dedup ratio on the receiving end however shows 1.0, implying the receiving end doesn't know that the received data is deduped. I guess turning dedup=on on the receiving end means deduplicating the stream of the already deduplicated/encrypted data again. I didn't try this but it also really doesn't make sense.
The purpose of my replications are pure backup, so all good if I can store the encrypted/deduped dataset elsewhere this way. I have no intention of mounting the dataset on the receiving end, but I wonder what would happen when all of a sudden the data appears deduplicated after all?
System information
Describe the problem you're observing
Sending encrypted data using 'zfs send -w' to a dataset with 'dedup=on' does not actually dedup data on the receiving end. When sending encrypted data that is also deduplicated on the source dataset, data is deduplicated is on the receiving end as expected.
Describe how to reproduce the problem
Create a datatset data with encryption=on and dedup=off. Send to an empty pool tank2 with:
zfs send -w tank/data_enc@now | zfs recv -o dedup=on tank2/test
After completing:
zfs get dedup tank2/test
shows dedup=onzpool list tank2
shows DEDUP 1.00xCreate a datatset data_dedup with encryption=on and dedup=on. Send to an empty pool tank2 with:
zfs send -w tank/data_enc_dedup@now | zfs recv -o dedup=on tank2/test
After completing:
zfs get dedup tank2/test
shows dedup=onzpool list tank2
shows DEDUP 1.30x (or any other value higher than 1.00x depending on the data)Include any warning/errors/backtraces from the system logs
No error in dmesg output, nor /var/log/syslog nor /var/log/kern.log.