Closed lresende closed 1 year ago
Hi @lresende - thanks for looking into this. Prior to removing this from WIP, could you please update the description with how or why java certificates interfere with multi-arch build's ability to complete in a timely manner? I think this would be good for folks to understand (including me). (Thanks)
I'm guessing the enterprise-gateway
image requires an update as well since it installs java and spark also - but that's only a hunch.
I've confirmed that the demo-base
changes are interfering with the signal handling in java - so any interrupts (which also take place during shutdown operations in order to stop the current processing) are producing this stack trace:
[I 2023-01-19 19:08:51.476 EnterpriseGatewayApp] Kernel interrupted: d3d82f99-a46d-4e9d-bd1c-010f2934c0f1
[I 230119 19:08:51 web:2239] 204 POST /api/kernels/d3d82f99-a46d-4e9d-bd1c-010f2934c0f1/interrupt (172.17.0.1) 1.37ms
2023-01-19 19:08:51,485 WARN layer.StandardComponentInitialization$$anon$1: Locked to Scala interpreter with SparkIMain until decoupled!
2023-01-19 19:08:51,485 WARN layer.StandardComponentInitialization$$anon$1: Unable to control initialization of REPL class server!
Exception in thread "SIGINT handler" java.lang.ExceptionInInitializerError
at org.apache.spark.package$.<init>(package.scala:93)
at org.apache.spark.package$.<clinit>(package.scala)
at org.apache.spark.SparkContext.$anonfun$new$1(SparkContext.scala:193)
at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
at org.apache.spark.SparkContext.logInfo(SparkContext.scala:82)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:193)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2690)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:949)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:943)
at org.apache.toree.kernel.api.Kernel.sparkSession(Kernel.scala:444)
at org.apache.toree.kernel.api.Kernel.sparkContext(Kernel.scala:449)
at org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.interrupt(ScalaInterpreter.scala:168)
at org.apache.toree.boot.layer.StandardHookInitialization$$anon$1.handle(HookInitialization.scala:83)
at jdk.unsupported/sun.misc.Signal$InternalMiscHandler.handle(Signal.java:198)
at java.base/jdk.internal.misc.Signal$1.run(Signal.java:220)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NullPointerException
at org.apache.spark.package$SparkBuildInfo$.<init>(package.scala:60)
at org.apache.spark.package$SparkBuildInfo$.<clinit>(package.scala)
... 18 more
Go figure. If I apply the following changes, all Scala and R tests pass, but all 4 interrupt tests on Python fail!
$ git diff
diff --git a/etc/docker/demo-base/Dockerfile b/etc/docker/demo-base/Dockerfile
index 4712994..1b852bc 100644
--- a/etc/docker/demo-base/Dockerfile
+++ b/etc/docker/demo-base/Dockerfile
@@ -41,7 +41,6 @@ RUN dpkg --purge --force-depends ca-certificates-java \
less \
nano \
ca-certificates \
- ca-certificates-java \
libkrb5-dev \
sudo \
locales \
@@ -54,9 +53,10 @@ RUN dpkg --purge --force-depends ca-certificates-java \
software-properties-common \
openssh-server \
openssh-client \
- && apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' \
- && apt-get update && apt-get install -yq --no-install-recommends openjdk-8-jdk \
- && rm -rf /var/lib/apt/lists/*
+ && apt-add-repository 'deb http://security.debian.org/debian-security stretch/updates main' \
+ && apt-get update && apt-get install -yq --no-install-recommends openjdk-8-jre-headless \
+ ca-certificates-java \
+ && rm -rf /var/lib/apt/lists/*
RUN ln -s $(readlink -f /usr/bin/javac | sed "s:/bin/javac::") ${JAVA_HOME}
It was not sufficient to only move the ca-certificates-java
entry OR only install the headless JRE (headless JDK didn't work), but BOTH had to be updated. Thought I had it resolved, then opened the tests up to all kernels and found the Python/interrupt tests failed!
I'm really at a loss how to proceed here.
Since demo-base
is essentially static (only updated when Spark Version changes), perhaps we could not use a multi-arch build for that, but just build for each architecture individually? Does that approach encounter the hang issue?
@leoyifeng - since you contributed the multi-arch build support, do you have ideas on why the addition of ca-certificates-java
(negatively) influences the way signals are processed?
Why was it necessary to move the installation of the JDK after the add-repository
command? (Sorry, not that familiar with the Linux installation stuff, but does that imply that the JDK (and anything afterward) was pulled from that repository?)
It would be much appreciated if you could also review the previous comments.
Thank you.
Good news! It turns out that this branch does not include the changes relative to the Python interrupt tests in #1239. As a result, the Python interrupt failures are a red herring.
It does, however, look like the changes to the headless JDK and the location for installation ca-certificates-java
are still necessary - but I'm going to reconfirm.
@kevin-bates I tried this locally and I believe I know what's going on, trying to rebuild with some changes and will provide an update in case of success.
When building the images for the 3.2.0 release, the multiarch build was hanging and you could see in the image build logs that they were trying to download/install java and never completing those. Google came to the rescue and suggested updating java certs which resolved the issues.
Note that, after building all images, only the three ones updated on this pr had issues related to java certs.