Closed mlanner closed 3 years ago
Interesting idea! I don't know anything about HDFS so I've done a small amount of reading...
rclone would work fine with HDFS if it was mounted in the file system (eg using NFS Proxy or Fuse). rclone is very undemanding on the local filesystem it scans directories, opens and reads or writes files in sequence which I think should all be supported by HDFS.
Alternatively it could use the API. There has been come work using go with hadoop
Or use LibHDFS
I don't have access to a hadoop cluster so I can't work on this, but would be grateful for assistance or patches!
Hi Nick,
Thanks for the quick response and links. I started looking at those ... and some others. I will probably spin up a test Hadoop cluster to see what I can do. I'll update this when/if I have any updates.
Anyone played more with making this happen?
Anyone played more with making this happen?
Not as far as I know. Do you want to have a go?
Here is a (go-)native hdfs client library: https://github.com/colinmarc/hdfs
I try to make HDFS support. Backend tests passed, but hashes not implemented.
I try to make HDFS support. Backend tests passed, but hashes not implemented.
Great work!
Is there a docker image we can test against?
i fail to find simple docker image, maybe this can help.
What have you been testing against?
i run tests with this image. it's a bit heavy, spark can be removed.
Are there hashes we could use?
i have read about hadoop checksums, and it's complicated:
Are you going to submit a PR?
sure.
Is there a docker image we can test against?
i fail to find simple docker image, maybe this can help.
What have you been testing against?
i run tests with this image. it's a bit heavy, spark can be removed.
OK
Are there hashes we could use?
i have read about hadoop checksums, and it's complicated:
- hadoop support different hashings methods MD5MD5CRC and COMPOSITE_CRC
- default method MD5MD5CRC depends on configuration
MD5MD5CRC looks like the S3 scheme of md5sums of md5sums which in practice is not at all useful since you don't know the block sizes.
Composite CRC looks like it could be useful though I didn't see how to calculate it.... Rclone supports CRC32-C already BTW.
Are you going to submit a PR?
I see it :-)
just to note, i working on PR to support composite_crc(crc32) checksums.
+1 - this would be extremely useful 👍
Thanks to @urykhy I've merged the HDFS backend into the latest beta now.
The first beta with the code in is v1.54.0-beta.5040.71edc75ca on branch master (uploaded in 15-30 mins)
Please test and put comments here - thank you :-)
Note that this doesn't support hashes yet - I think @urykhy is working on that at the moment.
I tested it in a kubernetes cluster to upload data to another S3 remote and it works like a charm! Thank you all for the great tool!
Thank you guys for this, I see that beta
image is available on docker hub. We will probably test hdfs
functionality at some point over next week or so and let you know how it goes!
Hi, I tested the beta
rclone v1.54.0-beta.5058.35a4de203
- os/arch: linux/amd64
- go version: go1.15.6
with basic file / directory operations and it worked fine.
I did test against your docker
docker run --rm --name "rclone-hdfs" -p 127.0.0.1:9866:9866 -p 127.0.0.1:8020:8020 --hostname "rclone-hdfs" rclone/test-hdfs
using
[hdfs-docker]
type = hdfs
namenode = localhost:8020
username = root
config as well as against k8s deployment in kind using this helm charts.
For the k8s config was
[hdfs]
type = hdfs
namenode = ${HDFS_SERVICE_HOST}:${HDFS_SERVICE_PORT}
username = hdfs
and I tested it from inside the k8s cluster so that datanodes are reachable.
The environmental variables corresponds to host and port of k8s service created by the gradiant/hdfs
chart.
I wonder if you have any plans to include kerberos
auth support in the hdfs
module?
kerberos
auth support in thehdfs
module?
+1
kerberos auth support in the hdfs module? +1
Also would be good to know if httpfs
support of HDFS will be something rclone
could include?
@urykhy what do you think?
@RafalSkolasinski
We are close to release 1.54 with hdfs+kerberos.
Also would be good to know if
httpfs
support of HDFS will be somethingrclone
could include?
Please submit as a separate request so it does not get lost in this overgrown ticket.
Thanks
I'm going to close this ticket now - thank you very for implementing it @urykhy :-)
Please make new tickets with feature requests for the HDFS backend.
Hi,
Any thoughts or plans around HDFS support? It could be very nice to have a way to, for example, bring a given data set out of Swift into Hadoop to run jobs on.