Closed ckadner closed 6 years ago
Thanks @ckadner. This issue also applies to the scala (yarn cluster) kernel.json file as well. We do the following to convey the alternate signal to Toree..
"SPARK_YARN_USER_ENV": "TOREE_ALTERNATE_SIGINT=USR2",
Updated the description to include a 4ᵀᴴ "hybrid" solution
@ckadner - Hmm. I was thinking the "hybrid" solution would be to take the locally defined config information that only includes kernel-specific settings (as described in solution 3) and add those into the appropriate places (as described in solution 2). The locally defined config information wouldn't necessarily be in a config-file format, but more in the format that is most easily merged into run.sh
variables. Pulling in the system-wide config information was a surprise to me.
I strongly believe we should avoid reading the standard config files (from $SPARK_HOME) since they will more than likely contain many things that are not applicable to kernel runtime envs and its location can even be changed away from $SPARK_HOME. As a result, I have to down-vote :-1: solution 4.
Deployed new env with HDP and spark 2.2 and came across this issue. Updated the kernel.json under R cluster mode with Christian's option #1 and seems to have resolved it . Change here: https://github.com/akchinSTC/enterprise_gateway/commit/d789061e7a49d4b679cac720594f544c5c383f06 Removed the old SPARK_YARN_USER_ENV as well. Scala kernel also having issues coming up but it may be unrelated to this after looking at the logs.
Went with Christian's Option 1. PR merged here: https://github.com/jupyter-incubator/enterprise_gateway/pull/256
Currently our Python kernel specs for YARN Client and YARN Cluster mode set up this environment:
But the env variable
SPARK_YARN_USER_ENV
(along with other formerly deprecated environment variables) will no longer be respected in Spark 2.2:"[SPARK-17979][SPARK-14453] Remove deprecated SPARK_YARN_USER_ENV ..." [8f0490e]:
Also see Spark PR #17212
The recommended way to configure the YARN user environment is via configuration properties specified in the
${SPARK_HOME}/conf/spark-defaults.conf
file or via the--conf ...
command line argument(s) to thespark-submit
command.... from the Spark docs:
Possible Solutions:
Change the
kernel.json
files to add additional--conf spark.yarn.appMasterEnv.[EnvironmentVariableName]
parameters to the existingspark-submit
options in"SPARK_OPTS": "--master ..."
but this will make thekernel.json
files less readable and could cause problems with nested quotes, escaped quotes, etc.Change our
run.sh
files to add additional--conf spark.yarn.appMasterEnv.[EnvironmentVariableName]
parameters to thespark-submit
command, which would allow for more flexible variable expansion, processing, quoting, etc., but may end up "hiding away" important settings.Add a new "example" Spark config file alongside our kernel files that contains all the necessary configuration and then pass it to Spark by adding
--properties-file <path-conf-file>
to the"SPARK_OPTS": "--master ..."
variable in thekernel.json
files. The problem with this approach is that Spark only accepts one config file, so we cannot simply "add" extra properties in addition to what is defined in${SPARK_HOME}/conf/spark-defaults.conf
.Hybrid of (2) and (3), modify our
run.sh
files to dynamically compose a properties file from the${SPARK_HOME}/conf/spark-defaults.conf
and the additional properties required, either by adding acat <<'EOF'
section to append properties or concatenating the default config file with our new "example" config file. The "informed" assumption for this approach is that, if there are duplicated property keys, what ever property comes last takes precedence.FYI, the environment variables
PYSPARK_DRIVER_PYTHON
andPYSPARK_PYTHON
do remain. We should start settingPYSPARK_DRIVER_PYTHON
in our Python kernel specs.