jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
620 stars 223 forks source link

Update java certs to avoid multi-arch builds to hang #1241

Closed lresende closed 1 year ago

lresende commented 1 year ago

When building the images for the 3.2.0 release, the multiarch build was hanging and you could see in the image build logs that they were trying to download/install java and never completing those. Google came to the rescue and suggested updating java certs which resolved the issues.

=> [linux/amd64  2/28] RUN dpkg --purge --force-depends ca-certificates-java     && apt-get update && apt-get -yq dist-upgrade     && apt-get install -yq --no-install-recommen  20118.7s
 => => # Setting up libfontconfig1:amd64 (2.13.1-4.2) ...                                                                                                                                 
 => => # Setting up libavahi-client3:amd64 (0.8-5+deb11u1) ...                                                                                                                            
 => => # Setting up libcups2:amd64 (2.3.3op2-3+deb11u2) ...                                                                                                                               
 => => # Setting up ca-certificates-java (20190909) ...                                                                                                                                   
 => => # /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                        
 => => # head: cannot open '/etc/ssl/certs/java/cacerts' for reading: No such file or directory      

Note that, after building all images, only the three ones updated on this pr had issues related to java certs.

kevin-bates commented 1 year ago

Hi @lresende - thanks for looking into this. Prior to removing this from WIP, could you please update the description with how or why java certificates interfere with multi-arch build's ability to complete in a timely manner? I think this would be good for folks to understand (including me). (Thanks)

I'm guessing the enterprise-gateway image requires an update as well since it installs java and spark also - but that's only a hunch.

kevin-bates commented 1 year ago

I've confirmed that the demo-base changes are interfering with the signal handling in java - so any interrupts (which also take place during shutdown operations in order to stop the current processing) are producing this stack trace:

[I 2023-01-19 19:08:51.476 EnterpriseGatewayApp] Kernel interrupted: d3d82f99-a46d-4e9d-bd1c-010f2934c0f1
[I 230119 19:08:51 web:2239] 204 POST /api/kernels/d3d82f99-a46d-4e9d-bd1c-010f2934c0f1/interrupt (172.17.0.1) 1.37ms
2023-01-19 19:08:51,485 WARN layer.StandardComponentInitialization$$anon$1: Locked to Scala interpreter with SparkIMain until decoupled!
2023-01-19 19:08:51,485 WARN layer.StandardComponentInitialization$$anon$1: Unable to control initialization of REPL class server!
Exception in thread "SIGINT handler" java.lang.ExceptionInInitializerError
    at org.apache.spark.package$.<init>(package.scala:93)
    at org.apache.spark.package$.<clinit>(package.scala)
    at org.apache.spark.SparkContext.$anonfun$new$1(SparkContext.scala:193)
    at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
    at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
    at org.apache.spark.SparkContext.logInfo(SparkContext.scala:82)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:193)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
    at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
    at org.apache.toree.kernel.api.Kernel.sparkSession(Kernel.scala:444)
    at org.apache.toree.kernel.api.Kernel.sparkContext(Kernel.scala:449)
    at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.interrupt(ScalaInterpreter.scala:168)
    at org.apache.toree.boot.layer.StandardHookInitialization$$anon$1.handle(HookInitialization.scala:83)
    at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Signal.java:198)
    at java.base/jdk.internal.misc.Signal$1.run(Signal.java:220)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
    at org.apache.spark.package$SparkBuildInfo$.<init>(package.scala:60)
    at org.apache.spark.package$SparkBuildInfo$.<clinit>(package.scala)
    ... 18 more
kevin-bates commented 1 year ago

Go figure. If I apply the following changes, all Scala and R tests pass, but all 4 interrupt tests on Python fail!

$ git diff
diff --git a/etc/docker/demo-base/Dockerfile b/etc/docker/demo-base/Dockerfile
index 4712994..1b852bc 100644
--- a/etc/docker/demo-base/Dockerfile
+++ b/etc/docker/demo-base/Dockerfile
@@ -41,7 +41,6 @@ RUN dpkg --purge --force-depends ca-certificates-java \
     less \
     nano \
     ca-certificates \
-    ca-certificates-java \
     libkrb5-dev \
     sudo \
     locales \
@@ -54,9 +53,10 @@ RUN dpkg --purge --force-depends ca-certificates-java \
     software-properties-common \
     openssh-server \
     openssh-client \
- && apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' \
- && apt-get update && apt-get install -yq --no-install-recommends openjdk-8-jdk \
- && rm -rf /var/lib/apt/lists/*
+    && apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' \
+    && apt-get update && apt-get install -yq --no-install-recommends openjdk-8-jre-headless \
+    ca-certificates-java \
+    && rm -rf /var/lib/apt/lists/*

 RUN ln -s $(readlink -f /usr/bin/javac | sed "s:/bin/javac::") ${JAVA_HOME}

It was not sufficient to only move the ca-certificates-java entry OR only install the headless JRE (headless JDK didn't work), but BOTH had to be updated. Thought I had it resolved, then opened the tests up to all kernels and found the Python/interrupt tests failed!

I'm really at a loss how to proceed here.

Since demo-base is essentially static (only updated when Spark Version changes), perhaps we could not use a multi-arch build for that, but just build for each architecture individually? Does that approach encounter the hang issue?

kevin-bates commented 1 year ago

@leoyifeng - since you contributed the multi-arch build support, do you have ideas on why the addition of ca-certificates-java (negatively) influences the way signals are processed?

Why was it necessary to move the installation of the JDK after the add-repository command? (Sorry, not that familiar with the Linux installation stuff, but does that imply that the JDK (and anything afterward) was pulled from that repository?)

It would be much appreciated if you could also review the previous comments.

Thank you.

kevin-bates commented 1 year ago

Good news! It turns out that this branch does not include the changes relative to the Python interrupt tests in #1239. As a result, the Python interrupt failures are a red herring.

It does, however, look like the changes to the headless JDK and the location for installation ca-certificates-java are still necessary - but I'm going to reconfirm.

lresende commented 1 year ago

@kevin-bates I tried this locally and I believe I know what's going on, trying to rebuild with some changes and will provide an update in case of success.