Spark 2.2 no longer supports SPARK_YARN_USER_ENV

ckadner commented 6 years ago

Currently our Python kernel specs for YARN Client and YARN Cluster mode set up this environment:

  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/anaconda2/bin/python",
    "PYTHONPATH": "${HOME}/.local/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip",
    "SPARK_YARN_USER_ENV": "PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip,PATH=/opt/anaconda2/bin:$PATH",
    "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false",
    "LAUNCH_OPTS": ""
  },

But the env variable SPARK_YARN_USER_ENV (along with other formerly deprecated environment variables) will no longer be respected in Spark 2.2:

"[SPARK-17979][SPARK-14453] Remove deprecated SPARK_YARN_USER_ENV ..." [8f0490e]:

@@ -748,14 +748,6 @@ resource-managers/yarn/src/main/.../deploy/yarn/Client.scala:

       .map { case (k, v) => (k.substring(amEnvPrefix.length), v) }
       .foreach { case (k, v) => YarnSparkHadoopUtil.addPathToEnvironment(env, k, v) }

-    // Keep this for backwards compatibility but users should move to the config
-    sys.env.get("SPARK_YARN_USER_ENV").foreach { userEnvs =>
-    // Allow users to specify some environment variables.
-      YarnSparkHadoopUtil.setEnvFromInputString(env, userEnvs)
-      // Pass SPARK_YARN_USER_ENV itself to the AM so it can use it to set up executor environments.
-      env("SPARK_YARN_USER_ENV") = userEnvs
-    }
-
     // If pyFiles contains any .py files, we need to add LOCALIZED_PYTHON_DIR to the PYTHONPATH
     // of the container processes too. Add all non-.py files directly to PYTHONPATH.
     //

Also see Spark PR #17212

The recommended way to configure the YARN user environment is via configuration properties specified in the ${SPARK_HOME}/conf/spark-defaults.conf file or via the --conf ... command line argument(s) to the spark-submit command.

... from the Spark docs:

When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. ... The user can specify multiple of these and to set multiple environment variables. In cluster mode this controls the environment of the Spark driver and in client mode it only controls the environment of the executor launcher. ... Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.

Possible Solutions:

Change the kernel.json files to add additional --conf spark.yarn.appMasterEnv.[EnvironmentVariableName] parameters to the existing spark-submit options in "SPARK_OPTS": "--master ..." but this will make the kernel.json files less readable and could cause problems with nested quotes, escaped quotes, etc.
Change our run.sh files to add additional --conf spark.yarn.appMasterEnv.[EnvironmentVariableName] parameters to the spark-submit command, which would allow for more flexible variable expansion, processing, quoting, etc., but may end up "hiding away" important settings.
Add a new "example" Spark config file alongside our kernel files that contains all the necessary configuration and then pass it to Spark by adding --properties-file <path-conf-file> to the "SPARK_OPTS": "--master ..." variable in the kernel.json files. The problem with this approach is that Spark only accepts one config file, so we cannot simply "add" extra properties in addition to what is defined in ${SPARK_HOME}/conf/spark-defaults.conf.
Hybrid of (2) and (3), modify our run.sh files to dynamically compose a properties file from the ${SPARK_HOME}/conf/spark-defaults.conf and the additional properties required, either by adding a cat <<'EOF' section to append properties or concatenating the default config file with our new "example" config file. The "informed" assumption for this approach is that, if there are duplicated property keys, what ever property comes last takes precedence.

FYI, the environment variables PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON do remain. We should start setting PYSPARK_DRIVER_PYTHON in our Python kernel specs.

kevin-bates commented 6 years ago

Thanks @ckadner. This issue also applies to the scala (yarn cluster) kernel.json file as well. We do the following to convey the alternate signal to Toree..

"SPARK_YARN_USER_ENV": "TOREE_ALTERNATE_SIGINT=USR2",

ckadner commented 6 years ago

Updated the description to include a 4ᵀᴴ "hybrid" solution

kevin-bates commented 6 years ago

@ckadner - Hmm. I was thinking the "hybrid" solution would be to take the locally defined config information that only includes kernel-specific settings (as described in solution 3) and add those into the appropriate places (as described in solution 2). The locally defined config information wouldn't necessarily be in a config-file format, but more in the format that is most easily merged into run.sh variables. Pulling in the system-wide config information was a surprise to me.

I strongly believe we should avoid reading the standard config files (from $SPARK_HOME) since they will more than likely contain many things that are not applicable to kernel runtime envs and its location can even be changed away from $SPARK_HOME. As a result, I have to down-vote :-1: solution 4.

akchinSTC commented 6 years ago

Deployed new env with HDP and spark 2.2 and came across this issue. Updated the kernel.json under R cluster mode with Christian's option #1 and seems to have resolved it . Change here: https://github.com/akchinSTC/enterprise_gateway/commit/d789061e7a49d4b679cac720594f544c5c383f06 Removed the old SPARK_YARN_USER_ENV as well. Scala kernel also having issues coming up but it may be unrelated to this after looking at the logs.

akchinSTC commented 6 years ago

Went with Christian's Option 1. PR merged here: https://github.com/jupyter-incubator/enterprise_gateway/pull/256

jupyter-server / enterprise_gateway

Spark 2.2 no longer supports SPARK_YARN_USER_ENV #218