Closed mikev closed 10 months ago
Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html
docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6
This also failed to set Hadoop to version 3.3.6
Appears that hadoop is bundled with Spark. So likely this is not a Jupyter build issue. In other words, Hadoop 3.3.4 is bundled with Spark 3.4.1
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ find . -name "hadoop*" ./jars/hadoop-client-api-3.3.4.jar ./jars/hadoop-client-runtime-3.3.4.jar ./jars/hadoop-shaded-guava-1.1.1.jar ./jars/hadoop-yarn-server-web-proxy-3.3.4.jar michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ cd .. michael@PC:/mnt/c/Users/mvier/code/helium$ ls spark-3.4.1-bin-hadoop3.tgz spark-3.4.1-bin-hadoop3.tgz michael@PC:/mnt/c/Users/mvier/code/helium$ wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
3.3.6 was inserted into the Spark build files last week. https://github.com/apache/spark/blob/f6e0b3906d533ab719f2423bd136d79215bfa315/pom.xml#L125
Appears we just need to wait for the next Spark 3.4.2 release which will include Hadoop 3.3.6.
Before this issue is closed. I'm wondering why --build-arg hadoop_version=3.3.6 has no effect?
Per the specifics doc you are supposed to be able to specify the Hadoop version when launching the image per specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html
Is there a work-around to configure a different Hadoop version?
Recap:
Attempted to dynamically update Hadoop to 3.3.6 via three methods: One my_packages = ["org.apache.hadoop:hadoop-aws:3.3.6"] spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate() Two docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6 Three Open Jupyter Terminal pip3 install aws-hadoop
None of the methods worked.
Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html
docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6
This also failed to set Hadoop to version 3.3.6
You need to build jupyter/pyspark-notebook
first, it's where spark is actually installed.
Overall, you're right, and we're only using the bundled Hadoop. So, we'll have to wait for an upstream release.
yes, Hadoop is bundled in Apache Spark.
Apache Spark 3.5.0 will soon start RC https://lists.apache.org/thread/z27z5nkzch66plpw88dkbmpt8gdlq044
docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6
This also failed to set Hadoop to version 3.3.6
This was for hadoop version 2 or version 3
There are some problems with Hadoop 3.3.6 https://github.com/apache/hadoop/pull/5706
https://lists.apache.org/thread/o7ockmppo5yqk2cm7f1kvo7plfgx6xnc
What docker image(s) are you using?
all-spark-notebook
Host OS system and architecture running docker image
Ubuntu 22.04
What Docker command are you running?
docker run -it -p 8888:8888 --user root -e GRANT_SUDO=yes -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook:spark-3.4.1
How to Reproduce the problem?
Visit localhost:8888
Open Terminal from Launcher
(base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*" /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar (base) jovyan@745e84c0ed21:/home$ (base) jovyan@745e84c0ed21:/home$
Command output
No response
Expected behavior
Expect to see hadoop-client-api-3.3.6.jar. Hadoop should be updated to latest which is 3.3.6 or greater.
Actual behavior
Although Spark is at version 3.4.1 the Hadoop library is still at 3.3.4
base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*" /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar /usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar (base) jovyan@745e84c0ed21:/home$
Anything else?
Our project uses AWS S3 and requires the requester-pays header on all S3 requests. This issue was described and fixed in Hadoop 3.3.5.
https://issues.apache.org/jira/browse/HADOOP-14661 The patch is here: https://issues.apache.org/jira/secure/attachment/12877218/HADOOP-14661.patch
Per the patch we're required to set "fs.s3a.requester-pays.enabled" to "true" This fix was enabled in aws-hadoop 3.3.5 and released on Mar 27, 2023.
I've tried to upgrade Hadoop in various ways and it still doesn't work. But I finally noticed that my hadoop is fixed at version 3.3.4. Somehow I can't seem to upgrade to 3.3.5. However Hadoop 3.3.5 was very recently released maybe something extra is needed to get the upgrade into Jupyter.
Latest Docker version