stackabletech / issues

This repository is only for issues that concern multiple repositories or don't fit into any specific repository
2 stars 0 forks source link

Provide means to run distcp or alternatives on a SDP HDFS stacklet #643

Open razvan opened 1 month ago

razvan commented 1 month ago

Description

Starting with SDP 24.7 the Hadoop image doesn't include the map-reduce jars anymore. This was done to reduce image size and the supply chain attack surface.

An unfortunate side-effect of this is that thehdfs distcp command doesn't work anymore. This command is used in some Stackable demos and is popular among HDFS users.

Possible solutions

  1. Create a separate image called hdfs-tools that includes the m/r jars.
  2. Find or implement an alternative tool with similar characteristics (fast, distributed copy between clusters and S3).

Acceptance criteria

TODO

jradmacher commented 1 month ago

Another problem is that distcp depends on YARN or runs "locally". This can hurt performance.

Sadly the only alternative I found is spark-distcp but it has not changed since 2022.

jradmacher commented 1 month ago

Another Workaround: Doing it manually like described here: https://kb.databricks.com/dbfs/parallelize-fs-operations