rclone / rclone

"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Azure Blob, Azure Files, Yandex Files
https://rclone.org
MIT License
44.95k stars 4.02k forks source link

hdfs: Hadoop TDE support #5726

Open alanmiller opened 2 years ago

alanmiller commented 2 years ago

The associated forum post URL from https://forum.rclone.org

Forum Post: https://forum.rclone.org/t/does-rclone-support-hdfs-tde/26996/9

What is your current rclone version (output from rclone version)?

rclone v1.56.2

What problem are you are trying to solve?

I'm trying to copy HDFS files from a source cluster that has Hadoop TDE enabled to a destination cluster where TDE is not enabled. I've posted examples in the forum post above but the summary is that:

  1. My source HDFS cluster (remote name: san_prod) the /prod hierarchy is an "encryption zone"
  2. In my destination HDFS cluster (remote name: las_prod) TDE encryption is not enabled.
  3. This command copies all files, but the files in destination cluster are corrupt. rclone copy san_prod:/prod1/test las_prod:/prod1/test

How do you think rclone should be changed to solve that?

For the immediate term, I'd suggest mentioning this limitation in the documentation. For the longer term, rclone should retrieve the unencrypted content of the HDFS files if they are in an encryption zone and transmit those contents to the destination cluster.

How to use GitHub

ncw commented 2 years ago

@urykhy as our resident HDFS expert do you have an opinion on how difficult it would be to add TDE support?

ncw commented 2 years ago

What is sounds like is happening is that rclone is just copying the encrypted files and not decrypting them.

For the immediate term, I'd suggest mentioning this limitation in the documentation.

This certainly sounds like a good short term measure

alanmiller commented 2 years ago

If the documentation gets updated to mention this limitation you could also include this workaround: Copy the data you intend to rclone, out of the TDE encryption zone, rclone the copy, delete the copy. E.g.:

  1. In source cluster (with TDE): hdfs distcp /encrypted-zone/data /un-encrypted/data
  2. Then run rclone: rclone copy source_cluster:/un-encrypted/data dest_prod:/encrypted-zone/data
urykhy commented 2 years ago
  1. Then run rclone: rclone copy source_cluster:/un-encrypted/data dest_prod:/encrypted-zone/data

there is no mistake here ? in my experiments we can't upload to encrypted-zone:

HADOOP_CONF_DIR=/tmp/xhadoop-conf-dir rclone copy test.plain hadoop:/test/key/
2021/10/31 22:07:02 ERROR : test.plain: Failed to copy: create /test/key/test.plain: create call failed with ERROR_APPLICATION (org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException)

btw, just to note. i currently working to implement TDE for upstream.

ncw commented 2 years ago

btw, just to note. i currently working to implement TDE for upstream.

Thank you :-)

channaba commented 7 months ago

Any updates on this issue (feature) resolution ?

urykhy commented 6 months ago

Any updates on this issue (feature) resolution ?

waiting on https://github.com/colinmarc/hdfs/pull/281