jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
616 stars 223 forks source link

Initialization issue of spark context? #440

Closed ziedbouf closed 5 years ago

ziedbouf commented 6 years ago

I am setting up the entreprise jupyter gateway but the initialization of spark context leads to erros for both client and cluster mode:

Also this seems to lead to the following error in the second run:

Container: container_e03_1536582358787_0027_02_000001 on spark-worker-1.c.mozn-location.internal_45454
LogAggregationType: AGGREGATED
======================================================================================================
LogType:stdout
LogLastModifiedTime:Tue Sep 11 07:35:25 +0000 2018
LogLength:520
LogContents:
Using connection file '/tmp/kernel-a3c49386-71a3-44b8-94de-1b870914c5fb_jvq2h0jy.json' instead of '/home/elyra/.local/share/jupyter/runtime/kernel-a3c49386-71a3-44b8-94de-1b870914c5fb.json'
Signal socket bound to host: 0.0.0.0, port: 46611
Traceback (most recent call last):
  File "launch_ipykernel.py", line 319, in <module>
    lower_port, upper_port)
  File "launch_ipykernel.py", line 142, in return_connection_info
    s.connect((response_ip, response_port))
ConnectionRefusedError: [Errno 111] Connection refused

End of LogType:stdout
***********************************************************************

I am trying to set this variable through either export or directly on the kernel side:

[elyra@spark-master ~]$ cat /usr/local/share/jupyter/kernels/spark_python_yarn_client/kernel.json 
{
  "language": "python",
  "display_name": "Spark - Python (YARN Client Mode)",
  "process_proxy": {
    "class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy"
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "PYTHON_WORKER_FACTORY_SECRET": "w<X?u6I&Ekt>49n}K5kBJ^QM@Zz)Mf",
    "SPARK_OPTS": "--master yarn --deploy-mode client --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID}",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh",
    "{connection_file}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.port-range",
    "{port_range}",
    "--RemoteProcessProxy.spark-context-initialization-mode",
    "lazy"
  ]
}

For the client mode, do you think this is linked to the following get os environment variable for PYTHON_WORKER_FACTORY_SECRET and java ports

Regarding the cluster mode, my understanding is that spark PythonRunner will auto initialize the gateway secret key that will be used by java_ gateway.

am I missing some configuration?

kevin-bates commented 6 years ago

I suspect your issues are related to not flowing the variables to the yarn processes. So while you have the variables defined in env: in the kernel.json, that essentially makes those variables available to the kernel launch - namely run.sh.

To flow these variables to the yarn master (in cluster mode) or workers (in client mode), you will do this via other configuration settings:

In cluster mode, this is accomplished via --conf spark.yarn.appMasterEnv.<variable1=<value1> --conf spark.yarn.appMasterEnv.<variable2=<value2> ... In client mode, this is accomplished via "SPARK_YARN_USER_ENV": "<variable1>=<value1>:<variable2>=<value2>:...

By the way, if these represent per-user values, you can flow these from the client by adding their names to the KG_ENV_WHITELIST (comma-separated). NB2KG and Gateway will then ensure they are in the env for the kernel's launch (in this case in run.sh). Or create names prefixed with KERNEL_, then massage those values into the expected names via run.sh, etc.

cc @lresende for more specific Spark/YARN advice

ziedbouf commented 6 years ago

@kevin-bates i using --conf spark.yarn.appMasterEnv.<variable1=<value1> --conf spark.yarn.appMasterEnv.<variable2=<value2> but this lead to Timeout for spinning up the spark context.

SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=this_secret_key --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:$PATH"

But this is lead to timeout yarn logs -applicationId application_1536672003321_0007:

18/09/12 11:56:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, elyra); groups with view permissions: Set(); users  with modify permissions: Set(yarn, elyra); groups with modify permissions: Set()
18/09/12 11:56:14 INFO ApplicationMaster: Preparing Local resources
18/09/12 11:56:15 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1536672003321_0007_000001
18/09/12 11:56:15 INFO ApplicationMaster: Starting the user application in a separate Thread
18/09/12 11:56:15 INFO ApplicationMaster: Waiting for spark context initialization...
18/09/12 11:57:55 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
    at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/09/12 11:57:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/09/12 11:57:55 INFO ShutdownHookManager: Shutdown hook called
screen shot 2018-09-12 at 3 27 22 pm

Same happens in the client mode and basically but i am still debugging to understand more the dynamics between different components.

I think i missed up with my setup :D and i might need to redo things from scratch.

kevin-bates commented 6 years ago

I think you're getting closer. You might try setting the following option to provide more time to create a spark context, although we have not had to typically do this with python kernels: --conf spark.yarn.am.waitTime=1d.

It seems to me like you should focus on a single mode for now. I would focus on 'cluster mode' since it doesn't require the distribution of the kernelspecs to the worker nodes and is probably where you want to be in the end anyway.

These stack traces look odd to me. Its almost like you're not using the python launcher (launch_ipykernel.py) that should reside in the scripts directory of your kernelspec.

After focusing only on cluster mode, adding the waitTime property and reattempting. Please provide the full EG log, run.sh and yarn cluster mode kernel.json files. In addition, since you're definitely creating the YARN application, take a look at the application log files (stdout and stderr are typically the most helpful). You should see output from the launch_ipykernel.py script relative to creation of the 5 ports, etc.

ziedbouf commented 6 years ago

I feel lost in here, is it possible to shade some light on the thought process. The following are:

[elyra@spark-master ~]$ cat /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/kernel.json 
{
  "language": "python",
  "display_name": "Spark - Python (YARN Cluster Mode)",
  "process_proxy": {
    "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=thisjustblabalabala --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:$PATH",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh",
    "{connection_file}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.port-range",
    "{port_range}",
    "--RemoteProcessProxy.spark-context-initialization-mode",
    "lazy"
  ]
}
[elyra@spark-master ~]$ cat /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh 
#!/usr/bin/env bash

export PYSPARK_GATEWAY_SECRET="w<X?u6I&Ekt>49n}K5kBJ^QM@Zz)Mf"

if [ "${EG_IMPERSONATION_ENABLED}" = "True" ]; then
        IMPERSONATION_OPTS="--proxy-user ${KERNEL_USERNAME:-UNSPECIFIED}"
        USER_CLAUSE="as user ${KERNEL_USERNAME:-UNSPECIFIED}"
else
        IMPERSONATION_OPTS=""
        USER_CLAUSE="on behalf of user ${KERNEL_USERNAME:-UNSPECIFIED}"
fi

echo
echo "Starting IPython kernel for Spark in Yarn Cluster mode ${USER_CLAUSE}"
echo

if [ -z "${SPARK_HOME}" ]; then
  echo "SPARK_HOME must be set to the location of a Spark distribution!"
  exit 1
fi

PROG_HOME="$(cd "`dirname "$0"`"/..; pwd)"

set -x
eval exec \
     "${SPARK_HOME}/bin/spark-submit" \
     "${SPARK_OPTS}" \
     "${IMPERSONATION_OPTS}" \
     "${PROG_HOME}/scripts/launch_ipykernel.py" \
     "${LAUNCH_OPTS}" \
     "$@"
set +x
[elyra@spark-master ~]$ nano  /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh 
[elyra@spark-master ~]$ exit
logout
[root@spark-master zied]# nano /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh 
[root@spark-master zied]# cat /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh 
#!/usr/bin/env bash

if [ "${EG_IMPERSONATION_ENABLED}" = "True" ]; then
        IMPERSONATION_OPTS="--proxy-user ${KERNEL_USERNAME:-UNSPECIFIED}"
        USER_CLAUSE="as user ${KERNEL_USERNAME:-UNSPECIFIED}"
else
        IMPERSONATION_OPTS=""
        USER_CLAUSE="on behalf of user ${KERNEL_USERNAME:-UNSPECIFIED}"
fi

echo
echo "Starting IPython kernel for Spark in Yarn Cluster mode ${USER_CLAUSE}"
echo

if [ -z "${SPARK_HOME}" ]; then
  echo "SPARK_HOME must be set to the location of a Spark distribution!"
  exit 1
fi

PROG_HOME="$(cd "`dirname "$0"`"/..; pwd)"

set -x
eval exec \
     "${SPARK_HOME}/bin/spark-submit" \
     "${SPARK_OPTS}" \
     "${IMPERSONATION_OPTS}" \
     "${PROG_HOME}/scripts/launch_ipykernel.py" \
     "${LAUNCH_OPTS}" \
     "$@"
set +x
[elyra@spark-master ~]$ tail -f /opt/elyra/log/enterprise_gateway_2018-09-12.log 
[D 2018-09-12 14:16:23.305 EnterpriseGatewayApp] Looking for jupyter_config in /opt/anaconda3/etc/jupyter
[D 2018-09-12 14:16:23.306 EnterpriseGatewayApp] Looking for jupyter_config in /home/elyra/.jupyter
[D 2018-09-12 14:16:23.306 EnterpriseGatewayApp] Looking for jupyter_config in /home/elyra
[D 2018-09-12 14:16:23.308 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /etc/jupyter
[D 2018-09-12 14:16:23.308 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /usr/local/etc/jupyter
[D 2018-09-12 14:16:23.308 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /opt/anaconda3/etc/jupyter
[D 2018-09-12 14:16:23.308 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/elyra/.jupyter
[D 2018-09-12 14:16:23.308 EnterpriseGatewayApp] Looking for jupyter_enterprise_gateway_config in /home/elyra
[D 180912 14:16:23 selector_events:65] Using selector: EpollSelector
[I 2018-09-12 14:16:23.325 EnterpriseGatewayApp] Jupyter Enterprise Gateway at http://0.0.0.0:8888
[D 2018-09-12 14:16:32.371 EnterpriseGatewayApp] Found kernel python3 in /opt/anaconda3/share/jupyter/kernels
[D 2018-09-12 14:16:32.371 EnterpriseGatewayApp] Found kernel ir in /opt/anaconda3/share/jupyter/kernels
[D 2018-09-12 14:16:32.371 EnterpriseGatewayApp] Found kernel spark_scala in /usr/local/share/jupyter/kernels
[D 2018-09-12 14:16:32.371 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2018-09-12 14:16:32.371 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[I 180912 14:16:32 web:2106] 200 GET /api/kernelspecs (127.0.0.1) 218.21ms
[D 2018-09-12 14:16:44.069 EnterpriseGatewayApp] Found kernel python3 in /opt/anaconda3/share/jupyter/kernels
[D 2018-09-12 14:16:44.070 EnterpriseGatewayApp] Found kernel ir in /opt/anaconda3/share/jupyter/kernels
[D 2018-09-12 14:16:44.070 EnterpriseGatewayApp] Found kernel spark_scala in /usr/local/share/jupyter/kernels
[D 2018-09-12 14:16:44.070 EnterpriseGatewayApp] Found kernel spark_python_yarn_cluster in /usr/local/share/jupyter/kernels
[D 2018-09-12 14:16:44.070 EnterpriseGatewayApp] Found kernel spark_python_yarn_client in /usr/local/share/jupyter/kernels
[I 180912 14:16:44 web:2106] 200 GET /api/kernelspecs (127.0.0.1) 5.52ms
[D 2018-09-12 14:16:44.263 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_cluster
[D 2018-09-12 14:16:44.273 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Cluster Mode)' with process proxy: enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy
[D 2018-09-12 14:16:44.277 EnterpriseGatewayApp] Response socket launched on 10.132.0.4, port: 56820 using 5.0s timeout
[D 2018-09-12 14:16:44.278 EnterpriseGatewayApp] Starting kernel: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json', '--RemoteProcessProxy.response-address', '10.132.0.4:56820', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2018-09-12 14:16:44.279 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Cluster Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json', '--RemoteProcessProxy.response-address', '10.132.0.4:56820', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']
[D 2018-09-12 14:16:44.279 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_181-amd64/bin:/usr/java/jdk1.8.0_181-amd64/jre/bin:/opt/anaconda3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/elyra/.local/bin:/home/elyra/bin', 'KERNEL_USERNAME': 'elyra', 'SPARK_HOME': '/usr/hdp/current/spark2-client', 'PYSPARK_PYTHON': '/opt/anaconda3/bin/python', 'PYTHONPATH': '/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip', 'SPARK_OPTS': '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=thisjustblabalabala --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:$PATH', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'EG_MIN_PORT_RANGE_SIZE': '1000', 'EG_MAX_PORT_RANGE_RETRIES': '5', 'KERNEL_ID': '915100ad-6520-416c-b01d-8d7f8dd73344', 'EG_IMPERSONATION_ENABLED': 'False'}
[D 2018-09-12 14:16:44.287 EnterpriseGatewayApp] Yarn cluster kernel launched using YARN endpoint: http://spark-master:8088/ws/v1/cluster, pid: 19827, Kernel ID: 915100ad-6520-416c-b01d-8d7f8dd73344, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh', '/home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json', '--RemoteProcessProxy.response-address', '10.132.0.4:56820', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Yarn Cluster mode on behalf of user elyra

+ eval exec /usr/hdp/current/spark2-client/bin/spark-submit '--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=thisjustblabalabala --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:$PATH' '' /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py '' /home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json --RemoteProcessProxy.response-address 10.132.0.4:56820 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /usr/hdp/current/spark2-client/bin/spark-submit --master yarn --deploy-mode cluster --name 915100ad-6520-416c-b01d-8d7f8dd73344 --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=thisjustblabalabala --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:/opt/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_181-amd64/bin:/usr/java/jdk1.8.0_181-amd64/jre/bin:/opt/anaconda3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/elyra/.local/bin:/home/elyra/bin /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json --RemoteProcessProxy.response-address 10.132.0.4:56820 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
ls: cannot access /usr/hdp/2.6.5.0/hadoop/lib: No such file or directory
[D 2018-09-12 14:16:44.792 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:45.296 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:45.800 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[D 2018-09-12 14:16:46.303 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:46.806 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:47 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/09/12 14:16:47 INFO RMProxy: Connecting to ResourceManager at spark-master.c.mozn-location.internal/10.132.0.4:8050
[D 2018-09-12 14:16:47.310 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:47 INFO Client: Requesting a new application from cluster with 3 NodeManagers
18/09/12 14:16:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
18/09/12 14:16:47 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/09/12 14:16:47 INFO Client: Setting up container launch context for our AM
18/09/12 14:16:47 INFO Client: Setting up the launch environment for our AM container
18/09/12 14:16:47 INFO Client: Preparing resources for our AM container
[D 2018-09-12 14:16:47.813 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:48.317 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:48.820 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[D 2018-09-12 14:16:49.324 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:49.827 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:50.331 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:50.834 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:51 INFO Client: Uploading resource file:/tmp/spark-c6c0b425-2acd-4b07-9df2-09731192d3d7/__spark_libs__8613768646819207724.zip -> hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065/__spark_libs__8613768646819207724.zip
[D 2018-09-12 14:16:51.337 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:51.841 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
[D 2018-09-12 14:16:52.345 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344' - retrying...
18/09/12 14:16:52 INFO Client: Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065/launch_ipykernel.py
18/09/12 14:16:52 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065/pyspark.zip
18/09/12 14:16:52 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip -> hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065/py4j-0.10.6-src.zip
18/09/12 14:16:52 INFO Client: Uploading resource file:/tmp/spark-c6c0b425-2acd-4b07-9df2-09731192d3d7/__spark_conf__178429387221043893.zip -> hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065/__spark_conf__.zip
18/09/12 14:16:52 INFO SecurityManager: Changing view acls to: elyra
18/09/12 14:16:52 INFO SecurityManager: Changing modify acls to: elyra
18/09/12 14:16:52 INFO SecurityManager: Changing view acls groups to: 
18/09/12 14:16:52 INFO SecurityManager: Changing modify acls groups to: 
18/09/12 14:16:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(elyra); groups with view permissions: Set(); users  with modify permissions: Set(elyra); groups with modify permissions: Set()
18/09/12 14:16:52 INFO Client: Submitting application application_1536672003321_0065 to ResourceManager
18/09/12 14:16:52 INFO YarnClientImpl: Submitted application application_1536672003321_0065
18/09/12 14:16:52 INFO Client: Application report for application_1536672003321_0065 (state: ACCEPTED)
18/09/12 14:16:52 INFO Client: 
     client token: N/A
     diagnostics: [Wed Sep 12 14:16:52 +0000 2018] Application is Activated, waiting for resources to be assigned for AM.  Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:73728, vCores:18> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 40.27778 % ; Queue's Absolute max capacity = 100.0 % ; 
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1536761812710
     final status: UNDEFINED
     tracking URL: http://spark-master.c.mozn-location.internal:8088/proxy/application_1536672003321_0065/
     user: elyra
18/09/12 14:16:52 INFO ShutdownHookManager: Shutdown hook called
18/09/12 14:16:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-d62022bd-bf07-421f-bcc1-2f937ba5bfd0
18/09/12 14:16:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-c6c0b425-2acd-4b07-9df2-09731192d3d7
[I 2018-09-12 14:16:52.849 EnterpriseGatewayApp] ApplicationID: 'application_1536672003321_0065' assigned for KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344', state: ACCEPTED, 8.0 seconds after starting.
[D 2018-09-12 14:16:52.852 EnterpriseGatewayApp] 17: State: 'ACCEPTED', Host: '', KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344', ApplicationID: 'application_1536672003321_0065'
[D 2018-09-12 14:16:53.357 EnterpriseGatewayApp] 18: State: 'ACCEPTED', Host: 'spark-worker-1.c.mozn-location.internal', KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344', ApplicationID: 'application_1536672003321_0065'
[D 2018-09-12 14:16:58.362 EnterpriseGatewayApp] Waiting for KernelID '915100ad-6520-416c-b01d-8d7f8dd73344' to send connection info from host 'spark-worker-1.c.mozn-location.internal' - retrying...
[D 2018-09-12 14:16:58.867 EnterpriseGatewayApp] 19: State: 'ACCEPTED', Host: 'spark-worker-1.c.mozn-location.internal', KernelID: '915100ad-6520-416c-b01d-8d7f8dd73344', ApplicationID: 'application_1536672003321_0065'
[D 2018-09-12 14:16:58.867 EnterpriseGatewayApp] Received Payload 'xXseh4YIaBIjHK40EaJYxpeu0HoUetzbK7D9SGaZMbM7jCqE2Yk5ctbJsl9wlJQq/+JTW86mPhXQc3IDOcaGupugD141PZA5SNX4q/zOM/fjSQFzSAlc02fywPr3wW6TLp//ZSCJfJD5cWFX4I0y2xWJFwU7foalmKXREk52F+bgFWJ3cL5NxKML8GzaiEWRICPffVimPVG0b1UhgXyi+9ya64lFlJ9U+kpuOYqgEgkhmxstTlu/5f2u3w47CHomw1N4TqviMxM0RAiXZRfcyyIXpkF4JZzAS3ZucXaEuHDf++/XuZcdHl2Hz0ACoqF5T2/8pXKhk58l1tK81Pgjl0pcpWmXTtsaJrhHoz9FQpZ7qbLxKWd9Yt/cvrGTfpjNuxC1+olNqqMwsUMAKbjBrA=='
[D 2018-09-12 14:16:58.867 EnterpriseGatewayApp] Decrypted Payload '{"shell_port": 52748, "iopub_port": 43518, "stdin_port": 58806, "control_port": 42488, "hb_port": 50091, "ip": "0.0.0.0", "key": "afe8afab-0b1d-408e-9c69-d4812550fc4c", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "2617", "pgid": "2569", "comm_port": 34636}'
[D 2018-09-12 14:16:58.868 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port': 52748, 'iopub_port': 43518, 'stdin_port': 58806, 'control_port': 42488, 'hb_port': 50091, 'ip': '0.0.0.0', 'key': 'afe8afab-0b1d-408e-9c69-d4812550fc4c', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '2617', 'pgid': '2569', 'comm_port': 34636}'
[D 2018-09-12 14:16:58.868 EnterpriseGatewayApp] Host assigned to the Kernel is: 'spark-worker-1.c.mozn-location.internal' '10.132.0.5'
[D 2018-09-12 14:16:58.868 EnterpriseGatewayApp] Established gateway communication to: 10.132.0.5:34636 for KernelID '915100ad-6520-416c-b01d-8d7f8dd73344'
[D 2018-09-12 14:16:58.868 EnterpriseGatewayApp] Updated pid to: 2617
[D 2018-09-12 14:16:58.868 EnterpriseGatewayApp] Updated pgid to: 2569
[D 2018-09-12 14:16:58.871 EnterpriseGatewayApp] Received connection info for KernelID '915100ad-6520-416c-b01d-8d7f8dd73344' from host 'spark-worker-1.c.mozn-location.internal': {'shell_port': 52748, 'iopub_port': 43518, 'stdin_port': 58806, 'control_port': 42488, 'hb_port': 50091, 'ip': '10.132.0.5', 'key': b'afe8afab-0b1d-408e-9c69-d4812550fc4c', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 34636}...
[D 2018-09-12 14:16:58.873 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:42488
[D 2018-09-12 14:16:58.875 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:43518
[I 2018-09-12 14:16:58.877 EnterpriseGatewayApp] Kernel started: 915100ad-6520-416c-b01d-8d7f8dd73344
[D 2018-09-12 14:16:58.877 EnterpriseGatewayApp] Kernel args: {'env': {'PATH': '/opt/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_181-amd64/bin:/usr/java/jdk1.8.0_181-amd64/jre/bin:/opt/anaconda3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/elyra/.local/bin:/home/elyra/bin', 'KERNEL_USERNAME': 'elyra'}, 'kernel_name': 'spark_python_yarn_cluster'}
[I 2018-09-12 14:16:58.877 EnterpriseGatewayApp] Culling kernels with idle durations > 600 seconds at 30 second intervals ...
[I 180912 14:16:58 web:2106] 201 POST /api/kernels (127.0.0.1) 14617.29ms
[I 180912 14:16:59 web:2106] 200 GET /api/kernels/915100ad-6520-416c-b01d-8d7f8dd73344 (127.0.0.1) 1.46ms
[D 2018-09-12 14:16:59.330 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/915100ad-6520-416c-b01d-8d7f8dd73344/channels
[W 2018-09-12 14:16:59.332 EnterpriseGatewayApp] No session ID specified
[D 2018-09-12 14:16:59.333 EnterpriseGatewayApp] Requesting kernel info from 915100ad-6520-416c-b01d-8d7f8dd73344
[D 2018-09-12 14:16:59.333 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:52748
[D 2018-09-12 14:16:59.342 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.343 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.343 EnterpriseGatewayApp] Received kernel info: {'status': 'ok', 'protocol_version': '5.1', 'implementation': 'ipython', 'implementation_version': '6.4.0', 'language_info': {'name': 'python', 'version': '3.6.5', 'mimetype': 'text/x-python', 'codemirror_mode': {'name': 'ipython', 'version': 3}, 'pygments_lexer': 'ipython3', 'nbconvert_exporter': 'python', 'file_extension': '.py'}, 'banner': "Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) \nType 'copyright', 'credits' or 'license' for more information\nIPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.\n", 'help_links': [{'text': 'Python Reference', 'url': 'https://docs.python.org/3.6'}, {'text': 'IPython Reference', 'url': 'https://ipython.org/documentation.html'}, {'text': 'NumPy Reference', 'url': 'https://docs.scipy.org/doc/numpy/reference/'}, {'text': 'SciPy Reference', 'url': 'https://docs.scipy.org/doc/scipy/reference/'}, {'text': 'Matplotlib Reference', 'url': 'https://matplotlib.org/contents.html'}, {'text': 'SymPy Reference', 'url': 'http://docs.sympy.org/latest/index.html'}, {'text': 'pandas Reference', 'url': 'https://pandas.pydata.org/pandas-docs/stable/'}]}
[I 2018-09-12 14:16:59.344 EnterpriseGatewayApp] Adapting to protocol v5.1 for kernel 915100ad-6520-416c-b01d-8d7f8dd73344
[I 180912 14:16:59 web:2106] 101 GET /api/kernels/915100ad-6520-416c-b01d-8d7f8dd73344/channels (127.0.0.1) 15.46ms
[D 2018-09-12 14:16:59.345 EnterpriseGatewayApp] Opening websocket /api/kernels/915100ad-6520-416c-b01d-8d7f8dd73344/channels
[D 2018-09-12 14:16:59.345 EnterpriseGatewayApp] Getting buffer for 915100ad-6520-416c-b01d-8d7f8dd73344
[D 2018-09-12 14:16:59.345 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:52748
[D 2018-09-12 14:16:59.346 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:43518
[D 2018-09-12 14:16:59.346 EnterpriseGatewayApp] Connecting to: tcp://10.132.0.5:58806
[D 2018-09-12 14:16:59.437 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.437 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.442 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.443 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.568 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:16:59.571 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: execute_input
[D 2018-09-12 14:16:59.571 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: execute_result
[D 2018-09-12 14:16:59.574 EnterpriseGatewayApp] activity on 915100ad-6520-416c-b01d-8d7f8dd73344: status
[D 2018-09-12 14:17:28.895 EnterpriseGatewayApp] Polling every 30 seconds for kernels idle > 600 seconds...
[D 2018-09-12 14:17:28.895 EnterpriseGatewayApp] kernel_id=915100ad-6520-416c-b01d-8d7f8dd73344, kernel_name=spark_python_yarn_cluster, last_activity=2018-09-12 14:16:59.574277+00:00
[D 2018-09-12 14:17:58.895 EnterpriseGatewayApp] Polling every 30 seconds for kernels idle > 600 seconds...
[D 2018-09-12 14:17:58.895 EnterpriseGatewayApp] kernel_id=915100ad-6520-416c-b01d-8d7f8dd73344, kernel_name=spark_python_yarn_cluster, last_activity=2018-09-12 14:16:59.574277+00:00
[elyra@spark-master ~]$ yarn logs -applicationId application_1536672003321_0065
18/09/12 14:17:33 INFO client.RMProxy: Connecting to ResourceManager at spark-master.c.mozn-location.internal/10.132.0.4:8050
18/09/12 14:17:33 INFO client.AHSProxy: Connecting to Application History server at spark-master.c.mozn-location.internal/10.132.0.4:10200
Container: container_e06_1536672003321_0065_01_000001 on spark-worker-1.c.mozn-location.internal:45454
LogAggregationType: LOCAL
======================================================================================================
LogType:directory.info
LogLastModifiedTime:Wed Sep 12 14:16:54 +0000 2018
LogLength:34810
LogContents:
ls -l:
total 20
-rw-r--r-- 1 yarn hadoop   69 Sep 12 14:16 container_tokens
-rwx------ 1 yarn hadoop  654 Sep 12 14:16 default_container_executor_session.sh
-rwx------ 1 yarn hadoop  708 Sep 12 14:16 default_container_executor.sh
-rwx------ 1 yarn hadoop 6385 Sep 12 14:16 launch_container.sh
lrwxrwxrwx 1 yarn hadoop   68 Sep 12 14:16 launch_ipykernel.py -> /hadoop/yarn/local/usercache/elyra/filecache/399/launch_ipykernel.py
lrwxrwxrwx 1 yarn hadoop   68 Sep 12 14:16 py4j-0.10.6-src.zip -> /hadoop/yarn/local/usercache/elyra/filecache/396/py4j-0.10.6-src.zip
lrwxrwxrwx 1 yarn hadoop   60 Sep 12 14:16 pyspark.zip -> /hadoop/yarn/local/usercache/elyra/filecache/395/pyspark.zip
lrwxrwxrwx 1 yarn hadoop   67 Sep 12 14:16 __spark_conf__ -> /hadoop/yarn/local/usercache/elyra/filecache/398/__spark_conf__.zip
lrwxrwxrwx 1 yarn hadoop   86 Sep 12 14:16 __spark_libs__ -> /hadoop/yarn/local/usercache/elyra/filecache/397/__spark_libs__8613768646819207724.zip
drwx--x--- 2 yarn hadoop    6 Sep 12 14:16 tmp
find -L . -maxdepth 5 -ls:
402660736    4 drwx--x---   3 yarn     hadoop       4096 Sep 12 14:16 .
419437566    0 drwx--x---   2 yarn     hadoop          6 Sep 12 14:16 ./tmp
402660737    4 -rw-r--r--   1 yarn     hadoop         69 Sep 12 14:16 ./container_tokens
402660738    4 -rw-r--r--   1 yarn     hadoop         12 Sep 12 14:16 ./.container_tokens.crc
402660739    8 -rwx------   1 yarn     hadoop       6385 Sep 12 14:16 ./launch_container.sh
402660740    4 -rw-r--r--   1 yarn     hadoop         60 Sep 12 14:16 ./.launch_container.sh.crc
402660741    4 -rwx------   1 yarn     hadoop        654 Sep 12 14:16 ./default_container_executor_session.sh
402660742    4 -rw-r--r--   1 yarn     hadoop         16 Sep 12 14:16 ./.default_container_executor_session.sh.crc
402660743    4 -rwx------   1 yarn     hadoop        708 Sep 12 14:16 ./default_container_executor.sh
402660744    4 -rw-r--r--   1 yarn     hadoop         16 Sep 12 14:16 ./.default_container_executor.sh.crc
285231561  532 -r-x------   1 yarn     hadoop     541536 Sep 12 14:16 ./pyspark.zip
327202114   16 drwx------   2 yarn     hadoop      12288 Sep 12 14:16 ./__spark_libs__
327202115  176 -r-x------   1 yarn     hadoop     178947 Sep 12 14:16 ./__spark_libs__/hk2-api-2.4.0-b34.jar
327202116   20 -r-x------   1 yarn     hadoop      16993 Sep 12 14:16 ./__spark_libs__/JavaEWAH-0.3.2.jar
327202117   96 -r-x------   1 yarn     hadoop      96221 Sep 12 14:16 ./__spark_libs__/commons-pool-1.5.4.jar
327202118  200 -r-x------   1 yarn     hadoop     201928 Sep 12 14:16 ./__spark_libs__/RoaringBitmap-0.5.11.jar
327202119  180 -r-x------   1 yarn     hadoop     181271 Sep 12 14:16 ./__spark_libs__/hk2-locator-2.4.0-b34.jar
327202120  232 -r-x------   1 yarn     hadoop     236660 Sep 12 14:16 ./__spark_libs__/ST4-4.0.4.jar
327202121   80 -r-x------   1 yarn     hadoop      79845 Sep 12 14:16 ./__spark_libs__/compress-lzf-1.0.3.jar
327202122   68 -r-x------   1 yarn     hadoop      69409 Sep 12 14:16 ./__spark_libs__/activation-1.1.1.jar
327202123  164 -r-x------   1 yarn     hadoop     164422 Sep 12 14:16 ./__spark_libs__/core-1.1.2.jar
327202124  128 -r-x------   1 yarn     hadoop     130802 Sep 12 14:16 ./__spark_libs__/aircompressor-0.8.jar
327202125  120 -r-x------   1 yarn     hadoop     118973 Sep 12 14:16 ./__spark_libs__/hk2-utils-2.4.0-b34.jar
327202126  436 -r-x------   1 yarn     hadoop     445288 Sep 12 14:16 ./__spark_libs__/antlr-2.7.7.jar
327202127   68 -r-x------   1 yarn     hadoop      69500 Sep 12 14:16 ./__spark_libs__/curator-client-2.7.1.jar
327202128  164 -r-x------   1 yarn     hadoop     164368 Sep 12 14:16 ./__spark_libs__/antlr-runtime-3.4.jar
327202129  184 -r-x------   1 yarn     hadoop     186273 Sep 12 14:16 ./__spark_libs__/curator-framework-2.7.1.jar
327202130  328 -r-x------   1 yarn     hadoop     334662 Sep 12 14:16 ./__spark_libs__/antlr4-runtime-4.7.jar
.....
327201763   32 -r-x------   1 yarn     hadoop      30108 Sep 12 14:16 ./__spark_libs__/spark-sketch_2.11-2.3.0.2.6.5.0-292.jar
327201764 8500 -r-x------   1 yarn     hadoop    8701418 Sep 12 14:16 ./__spark_libs__/spark-sql_2.11-2.3.0.2.6.5.0-292.jar
327201765 2120 -r-x------   1 yarn     hadoop    2170500 Sep 12 14:16 ./__spark_libs__/spark-streaming_2.11-2.3.0.2.6.5.0-292.jar
377489393   16 -r-x------   1 yarn     hadoop      13867 Sep 12 14:16 ./launch_ipykernel.py
302030326   80 -r-x------   1 yarn     hadoop      80352 Sep 12 14:16 ./py4j-0.10.6-src.zip
352325558    0 drwx------   3 yarn     hadoop        145 Sep 12 14:16 ./__spark_conf__
352325561    4 -r-x------   1 yarn     hadoop       1240 Sep 12 14:16 ./__spark_conf__/log4j.properties
352325562    8 -r-x------   1 yarn     hadoop       4956 Sep 12 14:16 ./__spark_conf__/metrics.properties
360727180    4 drwx------   2 yarn     hadoop       4096 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__
360727181    4 -r-x------   1 yarn     hadoop       2359 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/topology_script.py.backup
360727182    8 -r-x------   1 yarn     hadoop       7024 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/mapred-site.xml
360727183    8 -r-x------   1 yarn     hadoop       6355 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hadoop-env.sh
360727184   12 -r-x------   1 yarn     hadoop      10449 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/log4j.properties
360727185    4 -r-x------   1 yarn     hadoop       2509 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hadoop-metrics2.properties
360727186   20 -r-x------   1 yarn     hadoop      19415 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/yarn-site.xml
360727187    4 -r-x------   1 yarn     hadoop       3979 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hadoop-env.cmd
360727188    4 -r-x------   1 yarn     hadoop          1 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/yarn.exclude
360727189    4 -r-x------   1 yarn     hadoop          1 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/dfs.exclude
360727190    8 -r-x------   1 yarn     hadoop       4273 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/core-site.xml
360727191    4 -r-x------   1 yarn     hadoop        244 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/spark-thrift-fairscheduler.xml
360727192    4 -r-x------   1 yarn     hadoop       1631 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/kms-log4j.properties
360727193    4 -r-x------   1 yarn     hadoop       2250 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/yarn-env.cmd
360727194    4 -r-x------   1 yarn     hadoop        884 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/ssl-client.xml
360727195    4 -r-x------   1 yarn     hadoop       2035 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/capacity-scheduler.xml
360727196    4 -r-x------   1 yarn     hadoop       3518 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/kms-acls.xml
360727197    4 -r-x------   1 yarn     hadoop       2358 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/topology_script.py
360727198    4 -r-x------   1 yarn     hadoop        758 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/mapred-site.xml.template
360727199    4 -r-x------   1 yarn     hadoop       1335 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/configuration.xsl
360728640    8 -r-x------   1 yarn     hadoop       5327 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/yarn-env.sh
360728641    8 -r-x------   1 yarn     hadoop       6909 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hdfs-site.xml
360728642    4 -r-x------   1 yarn     hadoop       2319 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/gcs-connector-key.json
360728643    4 -r-x------   1 yarn     hadoop       1020 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/commons-logging.properties
360728644    4 -r-x------   1 yarn     hadoop       1019 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/container-executor.cfg
360728645    8 -r-x------   1 yarn     hadoop       4221 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/task-log4j.properties
360728646    4 -r-x------   1 yarn     hadoop       2490 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hadoop-metrics.properties
360728647    4 -r-x------   1 yarn     hadoop        818 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/mapred-env.sh
360728648    4 -r-x------   1 yarn     hadoop       1602 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/health_check
360728649    4 -r-x------   1 yarn     hadoop        752 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hive-site.xml
360728650    4 -r-x------   1 yarn     hadoop       2316 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/ssl-client.xml.example
360728651    4 -r-x------   1 yarn     hadoop       1527 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/kms-env.sh
360728652    4 -r-x------   1 yarn     hadoop       1308 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/hadoop-policy.xml
360728653    4 -r-x------   1 yarn     hadoop        119 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/slaves
360728654    4 -r-x------   1 yarn     hadoop        254 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/topology_mappings.data
360728655    4 -r-x------   1 yarn     hadoop       1000 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/ssl-server.xml
360728656    4 -r-x------   1 yarn     hadoop        951 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/mapred-env.cmd
360728657    4 -r-x------   1 yarn     hadoop       2697 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/ssl-server.xml.example
360728658    4 -r-x------   1 yarn     hadoop        945 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/taskcontroller.cfg
360728659    8 -r-x------   1 yarn     hadoop       5511 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/kms-site.xml
360728660    8 -r-x------   1 yarn     hadoop       4113 Sep 12 14:16 ./__spark_conf__/__hadoop_conf__/mapred-queues.xml.template
352325563  124 -r-x------   1 yarn     hadoop     123087 Sep 12 14:16 ./__spark_conf__/__spark_hadoop_conf__.xml
352325564    4 -r-x------   1 yarn     hadoop       2575 Sep 12 14:16 ./__spark_conf__/__spark_conf__.properties
broken symlinks(find -L . -maxdepth 5 -type l -ls):
End of LogType:directory.info.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
*******************************************************************************

End of LogType:prelaunch.err.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
******************************************************************************

Container: container_e06_1536672003321_0065_01_000001 on spark-worker-1.c.mozn-location.internal:45454
LogAggregationType: LOCAL
======================================================================================================
LogType:stdout
LogLastModifiedTime:Wed Sep 12 14:16:58 +0000 2018
LogLength:1596
LogContents:
Using connection file '/tmp/kernel-915100ad-6520-416c-b01d-8d7f8dd73344_chknmm6h.json' instead of '/home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json'
Signal socket bound to host: 0.0.0.0, port: 34636
JSON Payload 'b'{"shell_port": 52748, "iopub_port": 43518, "stdin_port": 58806, "control_port": 42488, "hb_port": 50091, "ip": "0.0.0.0", "key": "afe8afab-0b1d-408e-9c69-d4812550fc4c", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "2617", "pgid": "2569", "comm_port": 34636}'
Encrypted Payload 'b'xXseh4YIaBIjHK40EaJYxpeu0HoUetzbK7D9SGaZMbM7jCqE2Yk5ctbJsl9wlJQq/+JTW86mPhXQc3IDOcaGupugD141PZA5SNX4q/zOM/fjSQFzSAlc02fywPr3wW6TLp//ZSCJfJD5cWFX4I0y2xWJFwU7foalmKXREk52F+bgFWJ3cL5NxKML8GzaiEWRICPffVimPVG0b1UhgXyi+9ya64lFlJ9U+kpuOYqgEgkhmxstTlu/5f2u3w47CHomw1N4TqviMxM0RAiXZRfcyyIXpkF4JZzAS3ZucXaEuHDf++/XuZcdHl2Hz0ACoqF5T2/8pXKhk58l1tK81Pgjl0pcpWmXTtsaJrhHoz9FQpZ7qbLxKWd9Yt/cvrGTfpjNuxC1+olNqqMwsUMAKbjBrA=='
/opt/anaconda3/lib/python3.6/site-packages/IPython/paths.py:68: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
  " using a temp directory.".format(parent))
NOTE: When using the `ipython kernel` entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending
"quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see https://github.com/ipython/ipython/issues/2049

To connect another client to this kernel, use:
    --existing /tmp/kernel-915100ad-6520-416c-b01d-8d7f8dd73344_chknmm6h.json
End of LogType:stdout.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
***********************************************************************

Container: container_e06_1536672003321_0065_01_000001 on spark-worker-1.c.mozn-location.internal:45454
LogAggregationType: LOCAL
======================================================================================================
LogType:stderr
LogLastModifiedTime:Wed Sep 12 14:16:57 +0000 2018
LogLength:1649
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/elyra/filecache/397/__spark_libs__8613768646819207724.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/09/12 14:16:55 INFO SignalUtils: Registered signal handler for TERM
18/09/12 14:16:55 INFO SignalUtils: Registered signal handler for HUP
18/09/12 14:16:55 INFO SignalUtils: Registered signal handler for INT
18/09/12 14:16:55 INFO SecurityManager: Changing view acls to: yarn,elyra
18/09/12 14:16:55 INFO SecurityManager: Changing modify acls to: yarn,elyra
18/09/12 14:16:55 INFO SecurityManager: Changing view acls groups to: 
18/09/12 14:16:55 INFO SecurityManager: Changing modify acls groups to: 
18/09/12 14:16:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, elyra); groups with view permissions: Set(); users  with modify permissions: Set(yarn, elyra); groups with modify permissions: Set()
18/09/12 14:16:56 INFO ApplicationMaster: Preparing Local resources
18/09/12 14:16:57 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1536672003321_0065_000001
18/09/12 14:16:57 INFO ApplicationMaster: Starting the user application in a separate Thread
18/09/12 14:16:57 INFO ApplicationMaster: Waiting for spark context initialization...
End of LogType:stderr.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
***********************************************************************

Container: container_e06_1536672003321_0065_01_000001 on spark-worker-1.c.mozn-location.internal:45454
LogAggregationType: LOCAL
======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Wed Sep 12 14:16:54 +0000 2018
LogLength:100
LogContents:
Setting up env variables
Setting up job resources
Copying debugging information
Launching container
End of LogType:prelaunch.out.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
******************************************************************************

Container: container_e06_1536672003321_0065_01_000001 on spark-worker-1.c.mozn-location.internal:45454
LogAggregationType: LOCAL
======================================================================================================
LogType:launch_container.sh
LogLastModifiedTime:Wed Sep 12 14:16:54 +0000 2018
LogLength:6385
LogContents:
#!/bin/bash

set -o pipefail -e
export PRELAUNCH_OUT="/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/prelaunch.out"
exec >"${PRELAUNCH_OUT}"
export PRELAUNCH_ERR="/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/prelaunch.err"
exec 2>"${PRELAUNCH_ERR}"
echo "Setting up env variables"
export SPARK_YARN_STAGING_DIR="hdfs://spark-master.c.mozn-location.internal:8020/user/elyra/.sparkStaging/application_1536672003321_0065"
export PATH="/opt/anaconda3/bin/python:/opt/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_181-amd64/bin:/usr/java/jdk1.8.0_181-amd64/jre/bin:/opt/anaconda3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/elyra/.local/bin:/home/elyra/bin"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/usr/hdp/2.6.5.0-292/hadoop/conf"}
export MAX_APP_ATTEMPTS="2"
export JAVA_HOME=${JAVA_HOME:-"/usr/java/jdk1.8.0_181-amd64"}
export LANG="en_US.UTF-8"
export APP_SUBMIT_TIME_ENV="1536761812710"
export NM_HOST="spark-worker-1.c.mozn-location.internal"
export PYSPARK_PYTHON="/opt/anaconda3/bin/python"
export LOGNAME="elyra"
export JVM_PID="$$"
export PWD="/hadoop/yarn/local/usercache/elyra/appcache/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001"
export PYTHONHASHSEED="0"
export LOCAL_DIRS="/hadoop/yarn/local/usercache/elyra/appcache/application_1536672003321_0065"
export PYTHONPATH="/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip:$PWD/pyspark.zip:$PWD/py4j-0.10.6-src.zip"
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1536672003321_0065"
export NM_HTTP_PORT="8042"
export LOG_DIRS="/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export NM_PORT="45454"
export PYSPARK_GATEWAY_SECRET="thisjustblabalabala"
export USER="elyra"
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/2.6.5.0-292/hadoop-yarn"}
export CLASSPATH="$PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/usr/hdp/2.6.5.0-292/hadoop/conf:/usr/hdp/2.6.5.0-292/hadoop/*:/usr/hdp/2.6.5.0-292/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:/usr/hdp/current/ext/hadoop/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.6.5.0/hadoop/lib/hadoop-lzo-0.6.0.2.6.5.0.jar:/etc/hadoop/conf/secure:/usr/hdp/current/ext/hadoop/*:$PWD/__spark_conf__/__hadoop_conf__"
export HADOOP_TOKEN_FILE_LOCATION="/hadoop/yarn/local/usercache/elyra/appcache/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/container_tokens"
export NM_AUX_SERVICE_spark_shuffle=""
export SPARK_USER="elyra"
export LOCAL_USER_DIRS="/hadoop/yarn/local/usercache/elyra/"
export HADOOP_HOME="/usr/hdp/2.6.5.0-292/hadoop"
export PYTHONUSERBASE="/opt/anaconda3"
export HOME="/home/"
export NM_AUX_SERVICE_spark2_shuffle=""
export CONTAINER_ID="container_e06_1536672003321_0065_01_000001"
export MALLOC_ARENA_MAX="4"
echo "Setting up job resources"
ln -sf "/hadoop/yarn/local/usercache/elyra/filecache/395/pyspark.zip" "pyspark.zip"
ln -sf "/hadoop/yarn/local/usercache/elyra/filecache/397/__spark_libs__8613768646819207724.zip" "__spark_libs__"
ln -sf "/hadoop/yarn/local/usercache/elyra/filecache/399/launch_ipykernel.py" "launch_ipykernel.py"
ln -sf "/hadoop/yarn/local/usercache/elyra/filecache/396/py4j-0.10.6-src.zip" "py4j-0.10.6-src.zip"
ln -sf "/hadoop/yarn/local/usercache/elyra/filecache/398/__spark_conf__.zip" "__spark_conf__"
echo "Copying debugging information"
# Creating copy of launch script
cp "launch_container.sh" "/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/launch_container.sh"
chmod 640 "/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
ls -l 1>>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/directory.info"
echo "Launching container"
exec /bin/bash -c "LD_LIBRARY_PATH="/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH" $JAVA_HOME/bin/java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp -Dhdp.version=2.6.5.0 -Dspark.yarn.app.container.log.dir=/hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file launch_ipykernel.py --arg '/home/elyra/.local/share/jupyter/runtime/kernel-915100ad-6520-416c-b01d-8d7f8dd73344.json' --arg '--RemoteProcessProxy.response-address' --arg '10.132.0.4:56820' --arg '--RemoteProcessProxy.port-range' --arg '0..0' --arg '--RemoteProcessProxy.spark-context-initialization-mode' --arg 'lazy' --properties-file $PWD/__spark_conf__/__spark_conf__.properties 1> /hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/stdout 2> /hadoop/yarn/log/application_1536672003321_0065/container_e06_1536672003321_0065_01_000001/stderr"
End of LogType:launch_container.sh.This log file belongs to a running container (container_e06_1536672003321_0065_01_000001) and so may not be complete.
************************************************************************************

@kevin-bates both exists within the logging:

I agree with the log stack i got last it was mistake from my side playing with some configuration related PYSPARK_DRIVER_PYTHON=ipython.

lresende commented 6 years ago

@ziedbouf Thanks for your patience working through these issues. I have the suspicion that you are trying to use HDP which only supports Python 2.7x with Anaconda 3.6 and Py4j from this anaconda env. I would recommend you to try a vanilla environment that only used HDP environment as we describe here which will use PySpark and Py4j from the HDP distribution at least to rule out the Python mismatch or Py4j incompatibility.

kevin-bates commented 6 years ago

You're extremely close. So the spark context never finishes initialization? Are you able to run a pyspark app outside of EG?

At any rate, let me explain how kernel launching works. We (EG) essentially leverage the base framework so it helps to know, in general, how kernel launches work.

When jupyter gets a request to start a kernel, the user provides the name of the kernelspec directory that will be used. In there, jupyter expects to find a kernel.json file that contains, at a minimum 'display_name' and argv entries. This display name is what is used by Notebook in the kernels list. argv is essentially the command that is run. You'll notice that it takes a connection file name. (Btw, the curly-braced values in the argv are substitutions that are filled by Jupyter and EG.) For remote kernels, we also provide a response-address parameter as well as potential port-range specifications. The important one is the response-address.

Prior to invoking the command specified by argv, the jupyter framework will add each entry in the env stanza to the environment which will be in place when the argv command is performed. EG will also add any KERNEL_ values to the env as well - the main ones being KERNEL_ID and KERNEL_USERNAME.

Spark-based kernel launches typically require more massaging and parameter setup, so they use a run.sh script which does that kind of stuff. The thing to point out in run.sh is that the kernel launcher script (launch_ipykernel.py) is what is passed to spark-submit. This is because we want the launcher to be the kernel process. Since the launcher embeds the target kernel (which is why launchers are typically written in the same language as the kernel), it can perform communication with the kernel itself. This is how we can get away with not requiring kernel updates in order for EG to use a given kernel. The launcher, when started, creates 5 local ports and constructs the equivalent of a connection file. It also constructs a 6th port that it listens on for out-of-band commands from EG. This information is then returned in the response-address, where EG is listening to receive that information. You're logs indicate all that is happening fine.

Once the kernel has started, communication occurs between EG and the kernel directly, except in the cases for kernel interrupts, restarts, or shutdown. In those cases, EG sends a message to the 6th port that is listened to by the launcher. The launcher then performs the action (interrupt or shutdown) by signalling the embedded kernel thread directly.

Another thing the launcher does, based on the --RemoteProcessProxy.spark-context-initialization-mode parameter is create a spark context (or not if value is 'none'). This typically takes a few seconds.

So, we essentially leverage the complete framework for launching kernels. Where we diverge is that EG recognizes the process_proxy stanza in the kernelspec. When it starts the kernel, it will use an instance of the process-proxy class to control the lifecycle of the kernel - which essentially abstracts the process member variable used by the framework. This is how we can support various resource managers in a pluggable way.

We also add the various RemoteProcessProxy parameters to help facilitate remote behavior and other enterprise kinds of things - like port-range restrictions, etc.

I would recommend taking a look at the System Architecture section of the docs.

ziedbouf commented 6 years ago

Thanks @kevin-bates it start making some sense for me. So just to answer your question regarding running pyspark outside EG, yes i am using zeppelin in parallel and it's working fine with the same python path /opt/anaconda3/bin/python.

@lresende i agree on this, as i made some modification to get things run on python3 starting from fixing the topology_script.py.

In the case i go with python2, do you think that the following kernel.json is fine:

[elyra@spark-master ~]$ cat /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/kernel.json 
{
  "language": "python",
  "display_name": "Spark - Python (YARN Cluster Mode)",
  "process_proxy": {
    "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/anaconda3/bin/python",
    "PYTHONPATH": "/lib/python2.7/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=thisjustblabalabala --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/lib/python2.7/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH= /opt/anaconda3/bin/python:$PATH",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh",
    "{connection_file}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.port-range",
    "{port_range}",
    "--RemoteProcessProxy.spark-context-initialization-mode",
    "lazy"
  ]
}

Also, does this mean that our script should be written in python2 or i can pass variable to specify which execution env to use similar to the one with zeppelin zeppelin.pyspark.python = /opt/anaconda3/bin/python?

lresende commented 6 years ago

@ziedbouf I would still modify to use anaconda 2 "PYSPARK_PYTHON": "/opt/anaconda2/bin/python",

ziedbouf commented 6 years ago

In this case i should go through installation of anaconda2, as i didn't install it in the first place. Will go through, just one more questions i still run python3 on my notebooks i must write everything in py3 instead of py2? in cases yes, which environment variable do you advise to configure?

lresende commented 6 years ago

I want to start adding customizations from a working environment, and in HDP, which is based on Python 2.x, the vanilla configuration should work. After that, we can start introducing customizations, such as adding anaconda 3, and validate that it still works. The issue might end up to be a limitation from HDP which in this case will be having a python version mismatch.

ziedbouf commented 6 years ago

Sorry i close the issue by mistake, so @lresende first run using the default configuration with python2, -kernel.json:

{
  "language": "python",
  "display_name": "Spark - Python (YARN Cluster Mode)",
  "process_proxy": {
    "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy"
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/anaconda2/bin/python",
    "PYTHONPATH": "/opt/anaconda2/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/yarn/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda2/lib/python2.7/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda2/bin:$PATH",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh",
    "{connection_file}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.port-range",
    "{port_range}",
    "--RemoteProcessProxy.spark-context-initialization-mode",
    "lazy"
  ]
}

i want to note that the only difference between the default kernel.json and the one i use it the following:

Default: 
"PYTHONPATH": "${HOME}/.local/lib/python2.7/
 spark.yarn.appMasterEnv.PYTHONPATH=${HOME}/.local/lib/python2.7/site-packages
Local: 
 "PYTHONPATH": "/opt/anaconda2/lib/python2.7/site-packages
spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda2/lib/python2.7/site-packages:

As i was expecting this raise an error related to PYSPARK_GATEWAY_SECRET,

Using connection file '/tmp/kernel-73c700f8-a03d-45ad-aa04-c96fa82ef9c6_5xFFnQ.json' instead of '/home/elyra/.local/share/jupyter/runtime/kernel-73c700f8-a03d-45ad-aa04-c96fa82ef9c6.json'
Signal socket bound to host: 0.0.0.0, port: 53516
JSON Payload '{"stdin_port": 37306, "pgid": "4536", "ip": "0.0.0.0", "pid": "4583", "control_port": 39802, "hb_port": 39581, "signature_scheme": "hmac-sha256", "key": "394b7d07-2c3d-4e71-9da2-9175a659cf1c", "comm_port": 53516, "kernel_name": "", "shell_port": 58451, "transport": "tcp", "iopub_port": 59037}
Encrypted Payload '0eCmaI4Jmz1vuY6hPnLL5MDVzOWRkqVGR641cblKs6jIv2CNxx5eylkXns0wi3kwkjDnJ+gpdEGZLlnwmvYHZqXXOntHFXoRA2LFDSjBTdJF0RVAriSdrwrtf6jsGE4/Og78+fAJhcHAd8u1zYKpNsblGdq5e4yaYAwxZaqrYICn6k73sqEAqEi7TzrVjmwrKpRGoIh3UmA0RIKxS2o+wCusJ9fcXf0/zKbB7wl5oNTizydqDR1F2OlRjZsdBAjI6q1wJ2DJf3UVj+3vPx97vIaelrIildHlK7xEb3vwYIekgV4GRNSFKtyMsL5PZhnQNdIR65mWfTjy5mzEkwyWV9Vec7Cr8/e7KHL3r7uB5Yccn3KNZElD9Rrq4k+Z+1nRoBSyrZ41ty4lniCh5G2fug==
/opt/anaconda2/lib/python2.7/site-packages/IPython/paths.py:69: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
  " using a temp directory.".format(parent))
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/opt/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "launch_ipykernel.py", line 62, in initialize_spark_session
    spark = SparkSession.builder.getOrCreate()
  File "/opt/anaconda2/lib/python2.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/anaconda2/lib/python2.7/site-packages/pyspark/context.py", line 343, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/anaconda2/lib/python2.7/site-packages/pyspark/context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/opt/anaconda2/lib/python2.7/site-packages/pyspark/context.py", line 292, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/anaconda2/lib/python2.7/site-packages/pyspark/java_gateway.py", line 47, in launch_gateway
    gateway_secret = os.environ["PYSPARK_GATEWAY_SECRET"]
  File "/opt/anaconda2/lib/python2.7/UserDict.py", line 40, in __getitem__
    raise KeyError(key)
KeyError: 'PYSPARK_GATEWAY_SECRET'

As per @kevin-bates recommendation, i attached the following appEnv variable to the kernel: --conf spark.yarn.appMasterEnv.PYSPARK_GATEWAY_SECRET=just_a_secret_key

No error shows up in the error log of the app as an error but i got the following:

screen shot 2018-09-13 at 2 34 53 pm
kevin-bates commented 6 years ago

Given that we've never had to setup any of this PYSPARK_GATEWAY_ stuff, try not setting PYSPARK_GATEWAY_PORT or PYSPARK_GATEWAY_SECRET. Absence of PYSPARK_GATEWAY_PORT takes a different branch when launching the java_gateway - one that we must be taking in our environments.

Could you then attach the Enterprise Gateway log (from startup thru issue) and the YARN application logs. If tar/zip is easier, that's fine. Also, please include the kernel.json and run.sh files used. I know you've posted, but its easier to "touch" the files. (I'm sure you know what I mean.)

Thanks.

ziedbouf commented 6 years ago

@kevin-bates please find the logs as requested including yarn log, EG logs and and the kernel configuration.

spark_python_yarn_cluster_kernel_logs.tar.gz

kevin-bates commented 6 years ago

@ziedbouf - thank you so much for the complete set of files - its extremely helpful to see the entire picture.

Enterprise Gateway is working completely as expected and this confirms its purely a spark context creation issue. What is confusing to me is why you're encountering this and we have not ever seen this. I can't determine (due to lack of Spark knowledge) whether this "java gateway" is always "in play" for python sessions - I suspect it is. If it is, and given we do not ever deal with PYSPARK_GATEWAY_SECRET, then that would imply we do not have PYSPARK_GATEWATE_PORT in the env of the spark-submit per my previous post.

I'm wondering if it might be helpful to add print(os.environ) to the launch_ipykernel.py script just prior to creating the context. This output should go to the stdout file in the YARN logs and may help us better determine what is going on.

Its also a bit odd that we don't see any log messages in the EG log regarding the kernel's death. There should be some attempts at auto-restart since the exception should terminate the launcher. Hmm, actually, it's terminating the thread, but I believe the kernel would still be running, sans a spark context. So there's something funky there, but that's a complete side affect of the issue at hand - why the heck we can't start a spark context.

Have you tried running spark submit directly? You might need to massage some of the following...


exec /usr/hdp/current/spark2-client/bin/spark-submit --master yarn --deploy-mode cluster --name 471785be-cdf8-47d0-82b1-a134063ecc09 --conf spark.yarn.am.waitTime=1d --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/opt/anaconda3 --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/anaconda3/lib/python3.6/site-packages/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/anaconda3/bin/python:/opt/anaconda3/bin:/usr/lib64/qt-3.3/bin:/usr/java/jdk1.8.0_181-amd64/bin:/usr/java/jdk1.8.0_181-amd64/jre/bin:/opt/anaconda3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/elyra/.local/bin:/home/elyra/bin /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /home/elyra/.local/share/jupyter/runtime/kernel-471785be-cdf8-47d0-82b1-a134063ecc09.json --RemoteProcessProxy.response-address 10.132.0.4:56333 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy```
ziedbouf commented 6 years ago
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:102)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

With no logs to be accessed, but following this article from hortonworks i was been able to get some logs :/

Log Type: stdout

Log Upload Time: Fri Sep 14 17:22:05 +0000 2018

Log Length: 520

Using connection file '/tmp/kernel-471785be-cdf8-47d0-82b1-a134063ecc09_2gqh1c9o.json' instead of '/home/elyra/.local/share/jupyter/runtime/kernel-471785be-cdf8-47d0-82b1-a134063ecc09.json'
Signal socket bound to host: 0.0.0.0, port: 44190
Traceback (most recent call last):
  File "launch_ipykernel.py", line 320, in <module>
    lower_port, upper_port)
  File "launch_ipykernel.py", line 143, in return_connection_info
    s.connect((response_ip, response_port))
ConnectionRefusedError: [Errno 111] Connection refused

Log Type: prelaunch.err

Log Upload Time: Fri Sep 14 17:22:01 +0000 2018

Log Length: 0

Log Type: prelaunch.out

Log Upload Time: Fri Sep 14 17:22:01 +0000 2018

Log Length: 100

Setting up env variables
Setting up job resources
Copying debugging information
Launching container

Log Type: stderr

Log Upload Time: Fri Sep 14 17:22:05 +0000 2018

Log Length: 3899

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/root/filecache/67/__spark_libs__4683774330004883086.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/09/14 17:22:02 INFO SignalUtils: Registered signal handler for TERM
18/09/14 17:22:02 INFO SignalUtils: Registered signal handler for HUP
18/09/14 17:22:02 INFO SignalUtils: Registered signal handler for INT
18/09/14 17:22:02 INFO SecurityManager: Changing view acls to: yarn,root
18/09/14 17:22:02 INFO SecurityManager: Changing modify acls to: yarn,root
18/09/14 17:22:02 INFO SecurityManager: Changing view acls groups to: 
18/09/14 17:22:02 INFO SecurityManager: Changing modify acls groups to: 
18/09/14 17:22:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
18/09/14 17:22:03 INFO ApplicationMaster: Preparing Local resources
18/09/14 17:22:04 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1536934368460_0022_000001
18/09/14 17:22:04 INFO ApplicationMaster: Starting the user application in a separate Thread
18/09/14 17:22:04 INFO ApplicationMaster: Waiting for spark context initialization...
18/09/14 17:22:05 ERROR ApplicationMaster: User application exited with status 1
18/09/14 17:22:05 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1)
18/09/14 17:22:05 ERROR ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
    at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:102)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
18/09/14 17:22:05 INFO ShutdownHookManager: Shutdown hook called

It's seems that somehow i might need to increase it more?

kevin-bates commented 6 years ago

Increase what? The launch timeout? if so, no, 100 should be plenty. We rarely see more than 30.

What did you change? This implies you're going backwards ...

Signal socket bound to host: 0.0.0.0, port: 44190
Traceback (most recent call last):
  File "launch_ipykernel.py", line 320, in <module>
    lower_port, upper_port)
  File "launch_ipykernel.py", line 143, in return_connection_info
    s.connect((response_ip, response_port))
ConnectionRefusedError: [Errno 111] Connection refused

since you were able to return the connection info previously...

Signal socket bound to host: 0.0.0.0, port: 53516
JSON Payload '{"stdin_port": 37306, "pgid": "4536", "ip": "0.0.0.0", "pid": "4583", "control_port": 39802, "hb_port": 39581, "signature_scheme": "hmac-sha256", "key": "394b7d07-2c3d-4e71-9da2-9175a659cf1c", "comm_port": 53516, "kernel_name": "", "shell_port": 58451, "transport": "tcp", "iopub_port": 59037}
Encrypted Payload '0eCmaI4Jmz1vuY6hPnLL5MDVzOWRkqVGR641cblKs6jIv2CNxx5eylkXns0wi3kwkjDnJ+gpdEGZLlnwmvYHZqXXOntHFXoRA2LFDSjBTdJF0RVAriSdrwrtf6jsGE4/Og78+fAJhcHAd8u1zYKpNsblGdq5e4yaYAwxZaqrYICn6k73sqEAqEi7TzrVjmwrKpRGoIh3UmA0RIKxS2o+wCusJ9fcXf0/zKbB7wl5oNTizydqDR1F2OlRjZsdBAjI6q1wJ2DJf3UVj+3vPx97vIaelrIildHlK7xEb3vwYIekgV4GRNSFKtyMsL5PZhnQNdIR65mWfTjy5mzEkwyWV9Vec7Cr8/e7KHL3r7uB5Yccn3KNZElD9Rrq4k+Z+1nRoBSyrZ41ty4lniCh5G2fug==
/opt/anaconda2/lib/python2.7/site-packages/IPython/paths.py:69: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
  " using a temp directory.".format(parent))

and now you're probably getting a timeout-based exception in the EG log because it never got the information regarding the 5 ports.

ziedbouf commented 6 years ago

hi @kevin-bates sorry for the latest reply, i misunderstood your comment last time as i killed the jupyter kernel and tried to run the exec command.

Anyway it didn't work as i the job got killed and nothing really showing up to understand why things get stuck, so i listen to @lresende and re-run everything on fresh cluster using the ansible playbook and things works fine notes that i am using python2 instead of python3, so most likely that's compatibility issue,

screen shot 2018-09-16 at 4 19 40 pm

Thanks for the support @kevin-bates and @lresende, just one more questions which configuration to go with in order to setup py3 instead of py2:

import platform
platform.python_version()
lresende commented 6 years ago

While browsing another mailing list, it seems that the PYSPARK_GATEWAY_SECRET is a new change on Spark related to CVE-2018-1334. We might need to update EG code to support the change on recent Spark.

ziedbouf commented 6 years ago

Thanks @lresende, Can i help on this, any advises from where i can start as i am still exploring the overall stack.

ziedbouf commented 6 years ago

Also, one more things the py4j changes. This is mean that it can be better to generate the kernels based on the the current version of $SPARK_HOME/python/lib

kevin-bates commented 5 years ago

@lresende @ziedbouf - where do we stand on this issue? It looks like python 2 worked, but python 3 doesn't. Not sure how the PYSPARK_GATEWAY_SECRET stuff comes into play at this point since python 2 worked.

Just doing some housekeeping and would like to know if this issue can be closed.

ziedbouf commented 5 years ago

@kevin-bates that was an issue of using python 3 instead of python 2 with yarn and due to some shift on the infrastructure didn't have the chance to tackle the issue in more deeper fashion.

Also something i miss from yarn documentation in general, there is no clear documentation related to attaching kernels to yarn execution environment which lead to a lot of confusion during the debugging/Testing phase.

kevin-bates commented 5 years ago

@ziedbouf - thanks for the update. Regarding...

Also something i miss from yarn documentation in general, there is no clear documentation related to attaching kernels to yarn execution environment which lead to a lot of confusion during the debugging/Testing phase.

Are you speaking of the YARN portion of the Enterprise Gateway documentation or the Hadoop YARN documentation itself? If the Enterprise Gateway documentation could you please provide more details, or perhaps even a pull request containing the appropriate changes?

Thanks.

ziedbouf commented 5 years ago

@kevin-bates i mean the yarn documentation is not as trival as i am expecting on the process of how to attach to notebooks to yarn clusters.

kevin-bates commented 5 years ago

The Hadoop YARN docs shouldn't be covering anything regarding Notebooks. However, the YARN portion of the EG documentation may be missing something. I'm trying to understand what that might be so we might be able to fix our docs if their lacking information.

kevin-bates commented 5 years ago

@ziedbouf could you please provide a link to the yarn documentation you're referring to. I'd like to better understand where the disconnect is. Thanks.

ziedbouf commented 5 years ago

@kevin-bates i mean the yarn documentation (not related to jupyter-entreprise gateway) in general does not include any details related to the kernel integration.

I think most of the resources out there doesn't outline how jupyter kernels works in general and how we could create one. The following post seems to be a good start to grasp the idea of kernels in jupyter environement and I think @kevin-bates that it might be useful to create a series of blog posts part of jupyter entreprise gateway initiative to explain how kernels works in general.

kevin-bates commented 5 years ago

@ziedbouf - thank you for the clarification. There is no reason the YARN documentation should mention anything about jupyter kernels. The launching of jupyter kernels using YARN as a resource manager is completely unique to what Enterprise Gateway enables. As far as YARN is concerned, the kernel is just another application.

The enabling technology for running remote kernels launched from EG is the Kernel Launchers covered in the System Architecture section of the docs. (The reason I don't include a link to the Launchers section is because there are a number of changes pending to that section in PR #534 - and a link to the existing docs will break once merged.)

I agree that enhancing our docs (along with a blog post) to include the items necessary for creating a launcher would be helpful.

Are you looking to add support for another kernel type other than Python, R or Scala? If so, you might be a good candidate for documenting your experience. :smiley:

This issue has evolved into something completely different from initializing the spark context and I'm inclined to close this issue unless anyone objects.