jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
620 stars 223 forks source link

Bump spark version and use apache cdn to improve download time for Spark #1210

Closed luong-komorebi closed 1 year ago

luong-komorebi commented 1 year ago

This PR proposes bumping Spark version to the latest 3.2.x release. The reason for the bump is to improve download time for Spark, which in return would reduce build time for docker. This will be helpful for someone who needs docker rebuild to work with other systems as fast as possible ( Kubernetes, DockerSwarm ).

The latest version that the CDN supports is 3.2.3 and it is notable that for my location (Southeast Asia), using the current apache.org link makes it impossible to build the docker image under 4 hours.

Plus, on the way to change the download link file, I did run the itest-docker test locally again. HAProxy is not running so I also fixed that. The tests will never succeed if the kernels images are not present so they are preloaded before the test run in itest-docker-prep

Running itest cases gives:

>       self.assertRegex(interrupted_result, "java.lang.InterruptedException")
E       AssertionError: Regex didn't match: 'java.lang.InterruptedException' not found in 'begin\nend\n'

enterprise_gateway/itests/test_scala_kernel.py:65: AssertionError
=============================================================== slowest 10 durations ================================================================
62.15s call     enterprise_gateway/itests/test_scala_kernel.py::TestScalaKernelLocal::test_interrupt
13.75s call     enterprise_gateway/itests/test_scala_kernel.py::TestScalaKernelLocal::test_restart
11.38s setup    enterprise_gateway/itests/test_scala_kernel.py::TestScalaKernelLocal::test_get_hostname
5.80s call     enterprise_gateway/itests/test_r_kernel.py::TestRKernelLocal::test_restart
5.59s call     enterprise_gateway/itests/test_python_kernel.py::TestPythonKernelLocal::test_restart
5.37s setup    enterprise_gateway/itests/test_r_kernel.py::TestRKernelLocal::test_get_hostname
4.00s call     enterprise_gateway/itests/test_python_kernel.py::TestPythonKernelLocal::test_interrupt
3.76s setup    enterprise_gateway/itests/test_python_kernel.py::TestPythonKernelLocal::test_get_hostname
3.67s call     enterprise_gateway/itests/test_r_kernel.py::TestRKernelLocal::test_interrupt
1.00s call     enterprise_gateway/itests/test_scala_kernel.py::TestScalaKernelLocal::test_get_hostname
============================================================== short test summary info ==============================================================
FAILED enterprise_gateway/itests/test_scala_kernel.py::TestScalaKernelLocal::test_interrupt - AssertionError: Regex didn't match: 'java.lang.InterruptedException' not found in 'begin\nend\n'
===================================================== 1 failed, 12 passed in 120.50s (0:02:00) ======================================================

I am not really familiar with scala kernel, and it seems to not be interrupted. I am stuck here currently. If this idea is worth it, maybe I will start over and try to debug the scala integration test.

Reference:

  1. Apache foundation is moving from mirrors to cdn: https://news.apache.org/foundation/entry/apache-software-foundation-moves-to
  2. Linking directly to apache.org is discouraged: https://infra.apache.org/release-download-pages.html#:~:text=Do%20not%20link%20directly%20to%20dist.apache.org
  3. Version3.2.1 is no longer found in the mirror if using cdn https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz