rclone / rclone

"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Azure Blob, Azure Files, Yandex Files
https://rclone.org
MIT License
46.04k stars 4.12k forks source link

HDFS support #42

Closed mlanner closed 3 years ago

mlanner commented 9 years ago

Hi,

Any thoughts or plans around HDFS support? It could be very nice to have a way to, for example, bring a given data set out of Swift into Hadoop to run jobs on.

ncw commented 9 years ago

Interesting idea! I don't know anything about HDFS so I've done a small amount of reading...

rclone would work fine with HDFS if it was mounted in the file system (eg using NFS Proxy or Fuse). rclone is very undemanding on the local filesystem it scans directories, opens and reads or writes files in sequence which I think should all be supported by HDFS.

Alternatively it could use the API. There has been come work using go with hadoop

Or use LibHDFS

I don't have access to a hadoop cluster so I can't work on this, but would be grateful for assistance or patches!

mlanner commented 9 years ago

Hi Nick,

Thanks for the quick response and links. I started looking at those ... and some others. I will probably spin up a test Hadoop cluster to see what I can do. I'll update this when/if I have any updates.

seanorama commented 5 years ago

Anyone played more with making this happen?

ncw commented 5 years ago

Anyone played more with making this happen?

Not as far as I know. Do you want to have a go?

ei-grad commented 4 years ago

Here is a (go-)native hdfs client library: https://github.com/colinmarc/hdfs

urykhy commented 3 years ago

I try to make HDFS support. Backend tests passed, but hashes not implemented.

ncw commented 3 years ago

I try to make HDFS support. Backend tests passed, but hashes not implemented.

Great work!

urykhy commented 3 years ago

Is there a docker image we can test against?

i fail to find simple docker image, maybe this can help.

What have you been testing against?

i run tests with this image. it's a bit heavy, spark can be removed.

Are there hashes we could use?

i have read about hadoop checksums, and it's complicated:

Are you going to submit a PR?

sure.

ncw commented 3 years ago

Is there a docker image we can test against?

i fail to find simple docker image, maybe this can help.

What have you been testing against?

i run tests with this image. it's a bit heavy, spark can be removed.

OK

Are there hashes we could use?

i have read about hadoop checksums, and it's complicated:

MD5MD5CRC looks like the S3 scheme of md5sums of md5sums which in practice is not at all useful since you don't know the block sizes.

Composite CRC looks like it could be useful though I didn't see how to calculate it.... Rclone supports CRC32-C already BTW.

Are you going to submit a PR?

I see it :-)

urykhy commented 3 years ago

just to note, i working on PR to support composite_crc(crc32) checksums.

axsaucedo commented 3 years ago

+1 - this would be extremely useful 👍

ncw commented 3 years ago

Thanks to @urykhy I've merged the HDFS backend into the latest beta now.

The first beta with the code in is v1.54.0-beta.5040.71edc75ca on branch master (uploaded in 15-30 mins)

Please test and put comments here - thank you :-)

Note that this doesn't support hashes yet - I think @urykhy is working on that at the moment.

miguelpuyol commented 3 years ago

I tested it in a kubernetes cluster to upload data to another S3 remote and it works like a charm! Thank you all for the great tool!

RafalSkolasinski commented 3 years ago

Thank you guys for this, I see that beta image is available on docker hub. We will probably test hdfs functionality at some point over next week or so and let you know how it goes!

RafalSkolasinski commented 3 years ago

Hi, I tested the beta

rclone v1.54.0-beta.5058.35a4de203
- os/arch: linux/amd64
- go version: go1.15.6

with basic file / directory operations and it worked fine.

I did test against your docker

docker run --rm --name "rclone-hdfs" -p 127.0.0.1:9866:9866 -p 127.0.0.1:8020:8020 --hostname "rclone-hdfs" rclone/test-hdfs

using

[hdfs-docker]
type = hdfs
namenode = localhost:8020
username = root

config as well as against k8s deployment in kind using this helm charts.

For the k8s config was

[hdfs]
type = hdfs
namenode = ${HDFS_SERVICE_HOST}:${HDFS_SERVICE_PORT}
username = hdfs

and I tested it from inside the k8s cluster so that datanodes are reachable. The environmental variables corresponds to host and port of k8s service created by the gradiant/hdfs chart.

RafalSkolasinski commented 3 years ago

I wonder if you have any plans to include kerberos auth support in the hdfs module?

everpeace commented 3 years ago

kerberos auth support in the hdfs module?

+1

axsaucedo commented 3 years ago

kerberos auth support in the hdfs module? +1

RafalSkolasinski commented 3 years ago

Also would be good to know if httpfs support of HDFS will be something rclone could include?

ncw commented 3 years ago

@urykhy what do you think?

urykhy commented 3 years ago

kerberos - supported by library, so it must be simple to enable. i will play with it.

on httpfs - i think we can implement it as well.

ivandeex commented 3 years ago

@RafalSkolasinski

We are close to release 1.54 with hdfs+kerberos.

Also would be good to know if httpfs support of HDFS will be something rclone could include?

Please submit as a separate request so it does not get lost in this overgrown ticket.

Thanks

ncw commented 3 years ago

I'm going to close this ticket now - thank you very for implementing it @urykhy :-)

Please make new tickets with feature requests for the HDFS backend.