jupyter / docker-stacks

Ready-to-run Docker images containing Jupyter applications
https://jupyter-docker-stacks.readthedocs.io
Other
8k stars 2.99k forks source link

Update to latest Hadoop 3.3.6 #1937

Closed mikev closed 10 months ago

mikev commented 1 year ago

What docker image(s) are you using?

all-spark-notebook

Host OS system and architecture running docker image

Ubuntu 22.04

What Docker command are you running?

docker run -it -p 8888:8888 --user root -e GRANT_SUDO=yes -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook:spark-3.4.1

How to Reproduce the problem?

Visit localhost:8888

Open Terminal from Launcher

(base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*" /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar (base) jovyan@745e84c0ed21:/home$ (base) jovyan@745e84c0ed21:/home$

Command output

No response

Expected behavior

Expect to see hadoop-client-api-3.3.6.jar. Hadoop should be updated to latest which is 3.3.6 or greater.

Actual behavior

Although Spark is at version 3.4.1 the Hadoop library is still at 3.3.4

base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*" /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar (base) jovyan@745e84c0ed21:/home$

Anything else?

Our project uses AWS S3 and requires the requester-pays header on all S3 requests. This issue was described and fixed in Hadoop 3.3.5.

https://issues.apache.org/jira/browse/HADOOP-14661 The patch is here: https://issues.apache.org/jira/secure/attachment/12877218/HADOOP-14661.patch

Per the patch we're required to set "fs.s3a.requester-pays.enabled" to "true" This fix was enabled in aws-hadoop 3.3.5 and released on Mar 27, 2023.

I've tried to upgrade Hadoop in various ways and it still doesn't work. But I finally noticed that my hadoop is fixed at version 3.3.4. Somehow I can't seem to upgrade to 3.3.5. However Hadoop 3.3.5 was very recently released maybe something extra is needed to get the upgrade into Jupyter.

Latest Docker version

mikev commented 1 year ago

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

mikev commented 1 year ago

Appears that hadoop is bundled with Spark. So likely this is not a Jupyter build issue. In other words, Hadoop 3.3.4 is bundled with Spark 3.4.1

michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ find . -name "hadoop*" ./jars/hadoop-client-api-3.3.4.jar ./jars/hadoop-client-runtime-3.3.4.jar ./jars/hadoop-shaded-guava-1.1.1.jar ./jars/hadoop-yarn-server-web-proxy-3.3.4.jar michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ cd .. michael@PC:/mnt/c/Users/mvier/code/helium$ ls spark-3.4.1-bin-hadoop3.tgz spark-3.4.1-bin-hadoop3.tgz michael@PC:/mnt/c/Users/mvier/code/helium$ wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

mikev commented 1 year ago

3.3.6 was inserted into the Spark build files last week. https://github.com/apache/spark/blob/f6e0b3906d533ab719f2423bd136d79215bfa315/pom.xml#L125

Appears we just need to wait for the next Spark 3.4.2 release which will include Hadoop 3.3.6.

mikev commented 1 year ago

Before this issue is closed. I'm wondering why --build-arg hadoop_version=3.3.6 has no effect?

Per the specifics doc you are supposed to be able to specify the Hadoop version when launching the image per specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

Is there a work-around to configure a different Hadoop version?

mikev commented 1 year ago

Recap:

Attempted to dynamically update Hadoop to 3.3.6 via three methods: One my_packages = ["org.apache.hadoop:hadoop-aws:3.3.6"] spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate() Two docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6 Three Open Jupyter Terminal pip3 install aws-hadoop

None of the methods worked.

mathbunnyru commented 1 year ago

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

You need to build jupyter/pyspark-notebook first, it's where spark is actually installed.

mathbunnyru commented 1 year ago

Overall, you're right, and we're only using the bundled Hadoop. So, we'll have to wait for an upstream release.

bjornjorgensen commented 1 year ago

yes, Hadoop is bundled in Apache Spark.

Apache Spark 3.5.0 will soon start RC https://lists.apache.org/thread/z27z5nkzch66plpw88dkbmpt8gdlq044

bjornjorgensen commented 1 year ago

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

This was for hadoop version 2 or version 3

bjornjorgensen commented 1 year ago

There are some problems with Hadoop 3.3.6 https://github.com/apache/hadoop/pull/5706

https://lists.apache.org/thread/o7ockmppo5yqk2cm7f1kvo7plfgx6xnc