Open razvan opened 1 month ago
Another problem is that distcp depends on YARN or runs "locally". This can hurt performance.
Sadly the only alternative I found is spark-distcp but it has not changed since 2022.
Another Workaround: Doing it manually like described here: https://kb.databricks.com/dbfs/parallelize-fs-operations
Description
Starting with SDP 24.7 the Hadoop image doesn't include the map-reduce jars anymore. This was done to reduce image size and the supply chain attack surface.
An unfortunate side-effect of this is that the
hdfs distcp
command doesn't work anymore. This command is used in some Stackable demos and is popular among HDFS users.Possible solutions
Acceptance criteria
TODO