jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
621 stars 221 forks source link

spark python YARN client kernel, Timeout waiting for kernel_info reply #600

Closed glorysdj closed 5 years ago

glorysdj commented 5 years ago

Hi, I've configured jupyter enterprise gateway with spark python yarn client kernel, but always get a timeout warning, and can not execute any cell in the notebook. debug info is as below:

[D 2019-03-13 10:22:50.951 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_yarn_client, kernel_username: XXXX [D 2019-03-13 10:22:50.967 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Client Mode)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy [D 2019-03-13 10:22:50.969 EnterpriseGatewayApp] Starting kernel: ['/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', '--RemoteProcessProxy.kernel-id', '{kernel_id}', '--RemoteProcessProxy.response-address', '{response_address}', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy'] [D 2019-03-13 10:22:50.970 EnterpriseGatewayApp] Launching kernel: Spark - Python (YARN Client Mode) with command: ['/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', '--RemoteProcessProxy.kernel-id', '{kernel_id}', '--RemoteProcessProxy.response-address', '{response_address}', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy'] [D 2019-03-13 10:22:50.970 EnterpriseGatewayApp] BaseProcessProxy.launch_process() env: {'PATH': '/opt/work/conda/envs/jeg/bin:/opt/jdk8/bin:/bin:/opt/work/hadoop-2.7.2/bin:/bin:/opt/work/conda/envs/jeg/bin:/opt/work/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games', 'KERNEL_USERNAME': 'xxxx', 'KERNEL_WORKING_DIR': '/opt/work', 'SPARK_HOME': '/opt/work/spark-2.4.0-bin-hadoop2.7', 'EG_REMOTE_HOSTS': '172.168.2.165,172.168.2.166,172.168.2.186,172.168.2.187', 'PYSPARK_PYTHON': '/opt/work/conda/envs/jeg/bin/python', 'EG_KERNEL_LAUNCH_TIMEOUT': '600', 'EG_SOCKET_TIMEOUT': '600', 'PYTHONPATH': '/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip', 'SPARK_OPTS': '--master yarn --deploy-mode client --name ${KERNEL_ID:-ERRORNOKERNEL_ID}', 'LAUNCH_OPTS': '', 'KERNEL_GATEWAY': '1', 'KERNEL_ID': '74ce67f1-a8d7-44a6-a952-a5af86363cc6', 'EG_IMPERSONATION_ENABLED': 'False'} [I 2019-03-13 10:22:50.975 EnterpriseGatewayApp] Local kernel launched on '172.168.2.1', pid: 40408, pgid: 40408, KernelID: 74ce67f1-a8d7-44a6-a952-a5af86363cc6, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', '--RemoteProcessProxy.kernel-id', '{kernel_id}', '--RemoteProcessProxy.response-address', '{response_address}', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']' [D 2019-03-13 10:22:50.978 EnterpriseGatewayApp] Connecting to: tcp://127.0.0.1:57474

Starting IPython kernel for Spark in Yarn Client mode on behalf of user arda

[D 2019-03-13 10:22:50.980 EnterpriseGatewayApp] Connecting to: tcp://127.0.0.1:37504 [I 2019-03-13 10:22:50.984 EnterpriseGatewayApp] Kernel started: 74ce67f1-a8d7-44a6-a952-a5af86363cc6 t --name ${KERNEL_ID:-ERRORNOKERNEL_ID}' /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy RNEL_WORKING_DIR': '/opt/work'}, 'kernel_name': 'spark_python_yarn_client'} ++ exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode client --name 74ce67f1-a8d7-44a6-a952-a5af86363cc6 /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy [I 190313 10:22:50 web:2246] 201 POST /api/kernels (10.239.47.211) 36.33ms /opt/work/spark-2.4.0-bin-hadoop2.7 http://172.168.2.165:8188/cluster /opt/work/hadoop-2.7.2 /opt/work/hadoop-2.7.2/etc/hadoop /opt/work/hadoop-2.7.2/etc/hadoop [I 190313 10:22:50 web:2246] 200 GET /api/kernels/74ce67f1-a8d7-44a6-a952-a5af86363cc6 (10.239.47.211) 0.67ms [D 2019-03-13 10:22:51.120 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/74ce67f1-a8d7-44a6-a952-a5af86363cc6/channels [W 2019-03-13 10:22:51.125 EnterpriseGatewayApp] No session ID specified [D 2019-03-13 10:22:51.125 EnterpriseGatewayApp] Requesting kernel info from 74ce67f1-a8d7-44a6-a952-a5af86363cc6 [D 2019-03-13 10:22:51.126 EnterpriseGatewayApp] Connecting to: tcp://127.0.0.1:36177 2019-03-13 10:22:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [D 2019-03-13 10:22:53,135.135 launch_ipykernel] Using connection file '/tmp/kernel-{kernel_id}_ij02jnmf.json'. [E 2019-03-13 10:22:53,136.136 launch_ipykernel] Invalid format for response address '{response_address}'. Assuming 'pull' mode... 2019-03-13 10:22:53 INFO SparkContext:54 - Running Spark version 2.4.0 2019-03-13 10:22:53 INFO SparkContext:54 - Submitted application: 74ce67f1-a8d7-44a6-a952-a5af86363cc6 2019-03-13 10:22:53 INFO SecurityManager:54 - Changing view acls to: root 2019-03-13 10:22:53 INFO SecurityManager:54 - Changing modify acls to: root 2019-03-13 10:22:53 INFO SecurityManager:54 - Changing view acls groups to: 2019-03-13 10:22:53 INFO SecurityManager:54 - Changing modify acls groups to: 2019-03-13 10:22:53 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2019-03-13 10:22:53 INFO Utils:54 - Successfully started service 'sparkDriver' on port 48311. 2019-03-13 10:22:53 INFO SparkEnv:54 - Registering MapOutputTracker 2019-03-13 10:22:53 INFO SparkEnv:54 - Registering BlockManagerMaster 2019-03-13 10:22:53 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2019-03-13 10:22:53 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up 2019-03-13 10:22:53 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-3d267f51-b196-4cc6-b772-53955bd949d9 2019-03-13 10:22:53 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB 2019-03-13 10:22:54 INFO SparkEnv:54 - Registering OutputCommitCoordinator 2019-03-13 10:22:54 INFO log:192 - Logging initialized @2893ms 2019-03-13 10:22:54 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 2019-03-13 10:22:54 INFO Server:419 - Started @2980ms 2019-03-13 10:22:54 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 2019-03-13 10:22:54 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 2019-03-13 10:22:54 WARN Utils:66 - Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 2019-03-13 10:22:54 WARN Utils:66 - Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 2019-03-13 10:22:54 WARN Utils:66 - Service 'SparkUI' could not bind on port 4044. Attempting port 4045. 2019-03-13 10:22:54 INFO AbstractConnector:278 - Started ServerConnector@355ec79e{HTTP/1.1,[http/1.1]}{0.0.0.0:4045} 2019-03-13 10:22:54 INFO Utils:54 - Successfully started service 'SparkUI' on port 4045. 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@720da416{/jobs,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@73d7efa8{/jobs/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7a060513{/jobs/job,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4005468b{/jobs/job/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c634b8c{/stages,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4687e98e{/stages/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2228a0d2{/stages/stage,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2768cf67{/stages/stage/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@40a1d568{/stages/pool,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@145eedf5{/stages/pool/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1ebb5e1b{/storage,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7fc9b95c{/storage/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@310d2958{/storage/rdd,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@525f4bc9{/storage/rdd/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16d09c1f{/environment,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24e1c4c0{/environment/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c7e6eed{/executors,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@545773a8{/executors/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46f53d59{/executors/threadDump,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6c183683{/executors/threadDump/json,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@631e34eb{/static,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@42e1616f{/,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15634c5b{/api,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@10db9a44{/jobs/job/kill,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@43538a87{/stages/stage/kill,null,AVAILABLE,@Spark} 2019-03-13 10:22:54 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://XXX-Gateway:4045 2019-03-13 10:22:55 INFO RMProxy:98 - Connecting to ResourceManager at XXX-Node-065/172.168.2.165:8032 2019-03-13 10:22:55 INFO Client:54 - Requesting a new application from cluster with 4 NodeManagers 2019-03-13 10:22:55 INFO Client:54 - Verifying our application has not requested more than the maximum memory capability of the cluster (200000 MB per container) 2019-03-13 10:22:55 INFO Client:54 - Will allocate AM container, with 896 MB memory including 384 MB overhead 2019-03-13 10:22:55 INFO Client:54 - Setting up container launch context for our AM 2019-03-13 10:22:55 INFO Client:54 - Setting up the launch environment for our AM container 2019-03-13 10:22:55 INFO Client:54 - Preparing resources for our AM container 2019-03-13 10:22:55 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 2019-03-13 10:22:58 INFO Client:54 - Uploading resource file:/tmp/spark-1e062b0b-da1a-484b-b921-ea20182f4147/spark_libs678588766097818526.zip -> hdfs://XXX-Node-065:9000/user/root/.sparkStaging/application_1551765709575_0108/spark_libs678588766097818526.zip 2019-03-13 10:23:01 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip -> hdfs://XXX-Node-065:9000/user/root/.sparkStaging/application_1551765709575_0108/pyspark.zip 2019-03-13 10:23:01 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip -> hdfs://XXX-Node-065:9000/user/root/.sparkStaging/application_1551765709575_0108/py4j-0.10.7-src.zip 2019-03-13 10:23:01 INFO Client:54 - Uploading resource file:/tmp/spark-1e062b0b-da1a-484b-b921-ea20182f4147/spark_conf868785633236713899.zip -> hdfs://XXX-Node-065:9000/user/root/.sparkStaging/application_1551765709575_0108/spark_conf.zip 2019-03-13 10:23:01 INFO SecurityManager:54 - Changing view acls to: root 2019-03-13 10:23:01 INFO SecurityManager:54 - Changing modify acls to: root 2019-03-13 10:23:01 INFO SecurityManager:54 - Changing view acls groups to: 2019-03-13 10:23:01 INFO SecurityManager:54 - Changing modify acls groups to: 2019-03-13 10:23:01 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2019-03-13 10:23:02 INFO Client:54 - Submitting application application_1551765709575_0108 to ResourceManager 2019-03-13 10:23:02 INFO YarnClientImpl:273 - Submitted application application_1551765709575_0108 2019-03-13 10:23:02 INFO SchedulerExtensionServices:54 - Starting Yarn extension services with app application_1551765709575_0108 and attemptId None 2019-03-13 10:23:03 INFO Client:54 - Application report for application_1551765709575_0108 (state: ACCEPTED) 2019-03-13 10:23:03 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1552443913249 final status: UNDEFINED tracking URL: http://XXX-Node-065:8188/proxy/application_1551765709575_0108/ user: root 2019-03-13 10:23:04 INFO Client:54 - Application report for application_1551765709575_0108 (state: ACCEPTED) 2019-03-13 10:23:05 INFO Client:54 - Application report for application_1551765709575_0108 (state: ACCEPTED) 2019-03-13 10:23:06 INFO Client:54 - Application report for application_1551765709575_0108 (state: ACCEPTED) 2019-03-13 10:23:07 INFO Client:54 - Application report for application_1551765709575_0108 (state: ACCEPTED) 2019-03-13 10:23:08 INFO Client:54 - Application report for application_1551765709575_0108 (state: RUNNING) 2019-03-13 10:23:08 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: 172.168.2.187 ApplicationMaster RPC port: -1 queue: default start time: 1552443913249 final status: UNDEFINED tracking URL: http://XXX-Node-065:8188/proxy/application_1551765709575_0108/ user: root 2019-03-13 10:23:08 INFO YarnClientSchedulerBackend:54 - Application application_1551765709575_0108 has started running. 2019-03-13 10:23:08 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44691. 2019-03-13 10:23:08 INFO NettyBlockTransferService:54 - Server created on XXX-Gateway:44691 2019-03-13 10:23:08 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2019-03-13 10:23:08 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, XXX-Gateway, 44691, None) 2019-03-13 10:23:08 INFO BlockManagerMasterEndpoint:54 - Registering block manager XXX-Gateway:44691 with 366.3 MB RAM, BlockManagerId(driver, XXX-Gateway, 44691, None) 2019-03-13 10:23:08 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, XXX-Gateway, 44691, None) 2019-03-13 10:23:08 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, XXX-Gateway, 44691, None) 2019-03-13 10:23:08 INFO YarnClientSchedulerBackend:54 - Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> XXX-Node-065, PROXY_URI_BASES -> http://XXX-Node-065:8088/proxy/application_1551765709575_0108), /proxy/application_1551765709575_0108 2019-03-13 10:23:08 INFO JettyUtils:54 - Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill. 2019-03-13 10:23:08 INFO JettyUtils:54 - Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. 2019-03-13 10:23:08 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@592d45b9{/metrics/json,null,AVAILABLE,@Spark} 2019-03-13 10:23:08 INFO YarnSchedulerBackend$YarnSchedulerEndpoint:54 - ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM) 2019-03-13 10:23:12 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.168.2.187:50950) with ID 1 2019-03-13 10:23:12 INFO BlockManagerMasterEndpoint:54 - Registering block manager XXX-Node-087:46463 with 366.3 MB RAM, BlockManagerId(1, XXX-Node-087, 46463, None) 2019-03-13 10:23:14 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.168.2.166:40766) with ID 2 2019-03-13 10:23:14 INFO YarnClientSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 2019-03-13 10:23:14 INFO BlockManagerMasterEndpoint:54 - Registering block manager XXX-Node-066:40492 with 366.3 MB RAM, BlockManagerId(2, XXX-Node-066, 40492, None) [W 2019-03-13 10:23:51.128 EnterpriseGatewayApp] Timeout waiting for kernel_info reply from 74ce67f1-a8d7-44a6-a952-a5af86363cc6

kevin-bates commented 5 years ago

Hello - thank you for opening the issue.

tl;dr: you've got a misconfigured kernel.json file that is preventing EG from communicating with the kernel.

Right away I see a few anomalies with your kernelspec. You have either not included a ProcessProxy stanza or have explicitly set the process proxy to enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy, yet the other aspects of your launch invocation are referencing items of the remote process proxies. As a result, the connection information is a) not getting conveyed to the kernel and b) nor is it getting returned to EG.

Since you want to use YARN client mode and have already set two remote hosts, you should configure the process proxy as enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy. I suspect you removed the process proxy stanza from spark_python_yarn_client/kernel.json since that kernelspec comes with the DistributedProcessProxy already configured. There are ways to configure YARN client for a local kernel (see Next steps below), but keep in mind that all spark drivers will then be pinned to your EG server, where resources may not be as plentiful (and its not the design center for EG).

If you want to use YARN client mode such that your drivers (i.e., kernel processes) are distributed, then you must ensure that the nodes specified via EG_REMOTE_HOSTS have the same set of kernelspec directories present with the same parent directories. In addition, password-less ssh must be configured for the EG user across the nodes. To be honest, configuring YARN cluster mode is much easier because that duplication of kernelspecs is not required.

The message _Timeout waiting for kernelinfo reply is typically just a warning that appears after 10 seconds. However, in this case, due to the previous configuration issues, EG is not communicating with the kernel. That said, the actual timeout can be specified from the client via KERNEL_LAUNCH_TIMEOUT. Since it defaults to 30 seconds or so, you probably encountered a client-enforced timeout. Try increasing KERNEL_LAUNCH_TIMEOUT in the env of your notebook client. EG_KERNEL_LAUNCH_TIMEOUT is what the server uses and that limit (of 10 minutes) wasn't exceeded.

On the bright side, it looks like the application got submitted. This indicates the spark-submit portion of the invocation did its job, but the kernel launcher couldn't communicate back to EG because a response address wasn't established due to the process proxy misconfiguration. In addition, you might be encountering resource issues on your YARN cluster since I see 4 or 5 port conflicts for the Spark context, implying that many running applications. Of course, this will be a function of your cluster, but, if it's not large, you might need to clear some applications related to getting the kernel configuration to work. Once things get working, this shouldn't be too much of an issue.

Next steps: If you want to use YARN client mode across nodes.

  1. restore the kernel.json to include the DistributedProcessProxy stanza
  2. set/increase the KERNEL_LAUNCH_TIMEOUT on the client
  3. ensure that password-less ssh is configured between the EG node and those specified in EG_REMOTE_HOSTS and /usr/local/share/jupyter/kernels/* exists on those nodes.

No restart of EG should be necessary for those steps.

If you don't want the hassle of distributed mode (with ssh and duplication), but still need YARN client mode, you can replace this portion of your kernel.json file:

    "--RemoteProcessProxy.kernel-id",
    "{kernel_id}",

with this:

    "{connection_file}",

In this case, you would NOT add a process proxy stanza or change the referenced process proxy to enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy.

Recommended:

  1. move directly to YARN cluster mode where YARN determines the node.

Please check back after you've had a chance to try these options out. Thank you.

glorysdj commented 5 years ago

Hi thanks for your quick reply. this is the kernel.json. I also suspect it is a password-less ssh issue. I understand the difference between yarn client and yarn cluster. I am trying them both. Thanks for the help.

{ "language": "python", "display_name": "Spark - Python (YARN Client Mode)", "metadata": { "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy" } }, "env": { "SPARK_HOME": "/opt/work/spark-2.4.0-bin-hadoop2.7", "EG_REMOTE_HOSTS": "172.168.2.165,172.168.2.166,172.168.2.186,172.168.2.187", "PYSPARK_PYTHON": "/opt/work/conda/envs/jeg/bin/python", "EG_KERNEL_LAUNCH_TIMEOUT": "600", "EG_SOCKET_TIMEOUT": "600", "PYTHONPATH": "/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode client --name ${KERNEL_ID:-ERRORNOKERNEL_ID}", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh", "--RemoteProcessProxy.kernel-id", "{kernel_id}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }

kevin-bates commented 5 years ago

OK - thanks. That kernelspec looks good (although the indentation was probably lost in transfer). However, the log posted previously shows this:

[D 2019-03-13 10:22:50.967 EnterpriseGatewayApp] Instantiating kernel 'Spark - Python (YARN Client Mode)' with process proxy: enterprise_gateway.services.processproxies.processproxy.LocalProcessProxy

I'm now wondering what version of EG you're running. I believe we "moved" the process proxy stanza into the metadata: stanza following the 1.x release. So might you be trying to use EG 1.x with EG 2.x kernelspec definitions? This would explain why your process proxy class name is not getting recognized.

glorysdj commented 5 years ago

Yes. Wrong EG version. I will re-install 2.0 EG. Thanks.

jupyter_client 5.2.4 py_3 conda-forge jupyter_core 4.4.0 py_0 conda-forge jupyter_enterprise_gateway 1.1.1 py_0 conda-forge jupyter_kernel_gateway 2.2.0 py_0 conda-forge

@kevin-bates but how to install the 2.x EG?

conda search jupyter_enterprise_gateway Loading channels: done Name Version Build Channel jupyter_enterprise_gateway 0.7.0 py_0 conda-forge jupyter_enterprise_gateway 0.8.0 py_0 conda-forge jupyter_enterprise_gateway 0.9.0 py27_0 conda-forge jupyter_enterprise_gateway 0.9.1 py27_0 conda-forge jupyter_enterprise_gateway 0.9.2 py27_0 conda-forge jupyter_enterprise_gateway 0.9.3 py27_0 conda-forge jupyter_enterprise_gateway 0.9.3 py36_0 conda-forge jupyter_enterprise_gateway 0.9.4 py27_0 conda-forge jupyter_enterprise_gateway 0.9.4 py35_0 conda-forge jupyter_enterprise_gateway 0.9.4 py36_0 conda-forge jupyter_enterprise_gateway 1.0.0 py27_0 conda-forge jupyter_enterprise_gateway 1.0.0 py27_1 conda-forge jupyter_enterprise_gateway 1.0.0 py35_0 conda-forge jupyter_enterprise_gateway 1.0.0 py35_1 conda-forge jupyter_enterprise_gateway 1.0.0 py36_0 conda-forge jupyter_enterprise_gateway 1.0.0 py36_1 conda-forge jupyter_enterprise_gateway 1.0.0 py_1 conda-forge jupyter_enterprise_gateway 1.0.1 py_0 conda-forge jupyter_enterprise_gateway 1.1.0 py_0 conda-forge jupyter_enterprise_gateway 1.1.1 py_0 conda-forge

kevin-bates commented 5 years ago

Good news - yeah, kernelspecs and EG should be in-sync.

You have a couple options:

  1. Pull the whl and kernelspecs tar file from our releases page for 2.0.0-beta1 (we hope to have 2.0.0 GA shortly): https://github.com/jupyter/enterprise_gateway/releases/tag/v2.0.0-beta.1
  2. Keep 1.1.1 and pull kernelspecs tar from that release page: https://github.com/jupyter/enterprise_gateway/releases/tag/v1.1.1
glorysdj commented 5 years ago

Hi @kevin-bates I have configed a yarn cluster kernel with 1.1.1 version as below, but it failed with below log. Really appreciate your replies. Thanks.

kernel

{ "language": "python", "display_name": "Spark - Python (YARN Cluster Mode)", "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy" }, "env": { "SPARK_HOME": "/opt/work/spark-2.4.0-bin-hadoop2.7", "PYSPARK_PYTHON": "/opt/work/conda/envs/jeg/bin/python", "PYTHONPATH": "/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERRORNOKERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh", "{connection_file}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }

eg log

exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --name c10a700f-3f29-4657-bac1-ff46798e9ae3 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/arda/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:/opt/work/conda/envs/jeg/bin:/opt/jdk8/bin:/bin:/opt/work/hadoop-2.7.2/bin:/bin:/opt/work/conda/envs/jeg/bin:/opt/work/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /run/user/1000/jupyter/kernel-c10a700f-3f29-4657-bac1-ff46798e9ae3.json --RemoteProcessProxy.response-address 172.168.2.1:45198 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy /opt/work/spark-2.4.0-bin-hadoop2.7 http://172.168.2.165:8188/cluster /opt/work/hadoop-2.7.2 /opt/work/hadoop-2.7.2/etc/hadoop /opt/work/hadoop-2.7.2/etc/hadoop [W 2019-03-13 13:23:06.396 EnterpriseGatewayApp] YARN end-point: 'http://172.168.2.165:8188/cluster' refused the connection. Is the resource manager running? [D 2019-03-13 13:23:06.396 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... [W 2019-03-13 13:23:06.897 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:06.897 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... 2019-03-13 13:23:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [W 2019-03-13 13:23:07.398 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:07.398 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... [W 2019-03-13 13:23:07.899 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:07.900 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... 2019-03-13 13:23:08 INFO RMProxy:98 - Connecting to ResourceManager at Gondolin-Node-065/172.168.2.165:8032 2019-03-13 13:23:08 INFO Client:54 - Requesting a new application from cluster with 4 NodeManagers [W 2019-03-13 13:23:08.401 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:08.401 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying...

and YARN application error: Application application_1551765709575_0115 failed 2 times due to AM Container for appattempt_1551765709575_0115_000002 exited with exitCode: 13

glorysdj commented 5 years ago

Hi @kevin-bates I have config a yarn cluster kernel with 1.1.1 version as below, but it failed with below log. Really appreciate your replies. Thanks.

kernel

{ "language": "python", "display_name": "Spark - Python (YARN Cluster Mode)", "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy" }, "env": { "SPARK_HOME": "/opt/work/spark-2.4.0-bin-hadoop2.7", "PYSPARK_PYTHON": "/opt/work/conda/envs/jeg/bin/python", "PYTHONPATH": "/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERRORNOKERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh", "{connection_file}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }

eg log

exec /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --name c10a700f-3f29-4657-bac1-ff46798e9ae3 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/arda/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:/opt/work/conda/envs/jeg/bin:/opt/jdk8/bin:/bin:/opt/work/hadoop-2.7.2/bin:/bin:/opt/work/conda/envs/jeg/bin:/opt/work/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /run/user/1000/jupyter/kernel-c10a700f-3f29-4657-bac1-ff46798e9ae3.json --RemoteProcessProxy.response-address 172.168.2.1:45198 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy /opt/work/spark-2.4.0-bin-hadoop2.7 http://172.168.2.165:8188/cluster /opt/work/hadoop-2.7.2 /opt/work/hadoop-2.7.2/etc/hadoop /opt/work/hadoop-2.7.2/etc/hadoop [W 2019-03-13 13:23:06.396 EnterpriseGatewayApp] YARN end-point: 'http://172.168.2.165:8188/cluster' refused the connection. Is the resource manager running? [D 2019-03-13 13:23:06.396 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... [W 2019-03-13 13:23:06.897 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:06.897 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... 2019-03-13 13:23:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [W 2019-03-13 13:23:07.398 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:07.398 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... [W 2019-03-13 13:23:07.899 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:07.900 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying... 2019-03-13 13:23:08 INFO RMProxy:98 - Connecting to ResourceManager at Gondolin-Node-065/172.168.2.165:8032 2019-03-13 13:23:08 INFO Client:54 - Requesting a new application from cluster with 4 NodeManagers [W 2019-03-13 13:23:08.401 EnterpriseGatewayApp] Query for kernel ID 'c10a700f-3f29-4657-bac1-ff46798e9ae3' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing... [D 2019-03-13 13:23:08.401 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'c10a700f-3f29-4657-bac1-ff46798e9ae3' - retrying...

and YARN application error: Application application_1551765709575_0115 failed 2 times due to AM Container for appattempt_1551765709575_0115_000002 exited with exitCode: 13

I've changed to export EG_YARN_ENDPOINT=http://172.168.2.165:8188/ws/v1/cluster but still no luck.

[W 2019-03-13 13:35:24.285 EnterpriseGatewayApp] YARN end-point: 'http://172.168.2.165:8188/ws/v1/cluster' refused the connection. Is the resource manager running?

2019-03-13 13:35:34 INFO Client:54 - Submitting application application_1551765705_0116 to ResourceManager 2019-03-13 13:35:34 INFO YarnClientImpl:273 - Submitted application application_1765709575_0116 2019-03-13 13:35:34 INFO Client:54 - Application report for application_1551765705_0116 (state: ACCEPTED) 2019-03-13 13:35:34 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1552455465367 final status: UNDEFINED tracking URL: http://Gondolin-Node-065:8188/proxy/application_155176570950116/ user: root 2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Shutdown hook called 2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-93212-0d6f-4d71-8994-ecf1d70109a9 2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-f88a7-54a5-4c89-b87f-fce2c16d789a

glorysdj commented 5 years ago

when wget http://172.168.2.165:8188/ws/v1/cluster', i get below json. I think the YARM RM is working well.

`

1551765709575 1551765709575 STARTED ACTIVE org.apache.hadoop.yarn.server.resourcemanager.recovery.NullRMStateStore 2.7.2 2.7.2 from b165c4fe8a74265c792ce23f546c64604acf0e41 by jenkins source checksum c63f7cc71b8f63249e35126f0f7492d 2016-01-26T00:16Z 2.7.2 2.7.2 from b165c4fe8a74265c792ce23f546c64604acf0e41 by jenkins source checksum d0fda26633fa762bff87ec759ebe689c 2016-01-26T00:08Z ResourceManager HA is not enabled.

`

kevin-bates commented 5 years ago

It's getting late here. Got a couple observations - and will get back to this tomorrow.

  1. /opt/work/conda/envs/jeg/lib/python3.6 Although there are no obvious signs of conflict, this looks like you're using python 3 for EG, but HDP only supports python 2. When I've tried that, I see issues that aren't present here, so I don't think this is an issue right now - just heads up.
  2. Probably need to figure out why this is getting produced: failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'
  3. Can you please produce a pip list to show the versions in use? In particular: yarn-api-client
  4. What are in the stdout and stderr logs for the YARN application application_1551765705_0116

Btw, these messages:

2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Shutdown hook called
2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-93212-0d6f-4d71-8994-ecf1d70109a9
2019-03-13 13:35:34 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-f88a7-54a5-4c89-b87f-fce2c16d789a

are perfectly normal. They essentially occur once the application is accepted/running (I can't recall) but are a function of the spark.yarn.submit.waitAppCompletion=false setting.

glorysdj commented 5 years ago

thanks Kevin

  1. pip list

asn1crypto 0.24.0 attrs 19.1.0 backcall 0.1.0 bcrypt 3.1.4 bleach 3.1.0 certifi 2019.3.9 cffi 1.12.2 chardet 3.0.4 cryptography 2.5 decorator 4.3.2 defusedxml 0.5.0 entrypoints 0.3 idna 2.8 ipykernel 5.1.0 ipython 7.3.0 ipython-genutils 0.2.0 jedi 0.13.3 Jinja2 2.10 jsonschema 3.0.1 jupyter-client 5.2.4 jupyter-core 4.4.0 jupyter-enterprise-gateway 1.1.1 jupyter-kernel-gateway 2.2.0 MarkupSafe 1.1.1 mistune 0.8.4 nb2kg 0.6.0 nbconvert 5.4.1 nbformat 4.4.0 notebook 5.7.5 pandocfilters 1.4.2 paramiko 2.4.2 parso 0.3.4 pexpect 4.6.0 pickleshare 0.7.5 pip 19.0.3 prometheus-client 0.6.0 prompt-toolkit 2.0.9 ptyprocess 0.6.0 pyasn1 0.4.4 pycparser 2.19 pycrypto 2.6.1 Pygments 2.3.1 PyNaCl 1.3.0 pyOpenSSL 19.0.0 pyrsistent 0.14.11 PySocks 1.6.8 python-dateutil 2.8.0 pyzmq 18.0.1 requests 2.21.0 Send2Trash 1.5.0 setuptools 40.8.0 six 1.12.0 terminado 0.8.1 testpath 0.4.2 tornado 6.0.1 traitlets 4.3.2 urllib3 1.24.1 wcwidth 0.1.7 webencodings 0.5.1 wheel 0.33.1 yarn-api-client 0.2.3

  1. stdout and stderr logs for the YARN application

Application application_1551765709575_0117 failed 2 times due to AM Container for appattempt_1551765709575_0117_000002 exited with exitCode: 13 For more detailed output, check application tracking page:http://Gondolin-Node-065:8188/cluster/app/application_1551765709575_0117Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1551765709575_0117_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 13 Failing this attempt. Failing the application.

is this 13 exit code related to https://stackoverflow.com/questions/36535411/spark-runs-on-yarn-cluster-exitcode-13/36605869 something wrong with /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster?

just checked below spark example can run successfully. /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster /opt/work/spark-2.4.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.0.jar 10

but below spark command copied from EG will get exit code 13 error on YARN /opt/work/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster --name 01e9fe3e-2e31-4b3b-8b60-09ba026dae38 --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/arda/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jeg/lib/python3.6/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:/opt/work/conda/envs/jeg/bin:/opt/jdk8/bin:/bin:/opt/work/hadoop-2.7.2/bin:/bin:/opt/work/conda/envs/jeg/bin:/opt/work/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games /usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py /run/user/1000/jupyter/kernel-01e9fe3e-2e31-4b3b-8b60-09ba026dae38.json --RemoteProcessProxy.response-address 172.168.2.1:42067 --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy

kevin-bates commented 5 years ago

@glorysdj - thank you for running the spark examples and googling - I was going to recommend those steps. I'm assuming the configuration relative to a local[*] master has been ruled out as well.

So there are two issues here. First, we're not able to query for the application Id (exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'). Second, the exit code 13 issue.

For the query issue, please enable this env and restart EG: EG_YARN_LOG_LEVEL=DEBUG. This should produce more logging relative to the yarn-api-client. I would also switch to using the host name in your EG_YARN_ENDPOINT, since 172.168... is a private IP. I say this because I see the YARN logs using node names in their URIs - so perhaps there's some NIC issue here.

For the exit code issue... I think we need to eliminate anything python 3 related from the spark settings in kernel.json and bin/run.sh. As a result, I would remove any paths prefixed with /opt/work/conda/envs/jeg/ since that appears to be a python 3 env.

If that doesn't solve the issue, then you can try to carefully build the spark sample back up to match that of the kernelspec settings (sans /opt/work/conda/envs/jeg). You would need to switch the sample to use pyspark.

Btw, you switched to cluster mode. I'm assuming you didn't get client mode working - correct? If so, let's stick with cluster and move back to client once it is resolved (unless you're happy with just cluster).

Thank you for your patience. Integrating EG with YARN can be challenging.

glorysdj commented 5 years ago

@kevin-bates thanks for the help

first, i tried with EG_YARN_LOG_LEVEL=DEBUG and EG_YARN_ENDPOINT=http://Gondolin-Node-065:8188/ws/v1/cluster

got below error:

[I 190314 08:38:35 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000 [W 2019-03-14 08:38:35.974 EnterpriseGatewayApp] YARN end-point: 'http://Gondolin-Node-065:8188/ws/v1/cluster' refused the connection. Is the resource manager running? [D 2019-03-14 08:38:35.975 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'e65c15da-4684-4563-a24a-03a37458c4a4' - retrying... Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657) at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251) at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:120) at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.(SparkSubmit.scala:911) at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [I 190314 08:38:36 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000

please be noticed that: request is sent to 8088 but not 8188 which we configured for yarn http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000

and I do have HADOOP_CONF_DIR and YARN_CONF_DIR, i am not sure where should I set these ENV, should they be put in a specified conf file or just in the bash with export ... (jeg) root@Gondolin-Gateway:/opt/work# echo $HADOOP_CONF_DIR /opt/work/hadoop-2.7.2/etc/hadoop (jeg) root@Gondolin-Gateway:/opt/work# echo $YARN_CONF_DIR /opt/work/hadoop-2.7.2/etc/hadoop (jeg) root@Gondolin-Gateway:/opt/work# echo $EG_YARN_ENDPOINT http://Gondolin-Node-065:8188/ws/v1/cluster

glorysdj commented 5 years ago

Hi @kevin-bates ,

As you mentioned, we don't get YARN client mode work, we do prefer YARN cluster mode which run spark driver in the YARN cluster instead of the EG server. Actually I planed such a deployment, with a jupyterhub, users can run notebooks on different remote machines, and with EG to run spark jobs in a deployed YARN cluster. I know there is a K8S solution, but for now we want to enable this on a YARN cluster. And for the first step, we are trying to enable EG with YARN. We are also investigate JupyterHub and different spawners, but remote spawner and yarnspawner is not working after our first several tests. First is first, let's get EG with YARN work first. Thanks for your help and any suggestion will be appreciated.

glorysdj commented 5 years ago

@kevin-bates thanks for the help

first, i tried with EG_YARN_LOG_LEVEL=DEBUG and EG_YARN_ENDPOINT=http://Gondolin-Node-065:8188/ws/v1/cluster

got below error:

[I 190314 08:38:35 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000 [W 2019-03-14 08:38:35.974 EnterpriseGatewayApp] YARN end-point: 'http://Gondolin-Node-065:8188/ws/v1/cluster' refused the connection. Is the resource manager running? [D 2019-03-14 08:38:35.975 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: 'e65c15da-4684-4563-a24a-03a37458c4a4' - retrying... Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657) at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:290) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:251) at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:120) at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.(SparkSubmit.scala:911) at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [I 190314 08:38:36 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000

please be noticed that: request is sent to 8088 but not 8188 which we configured for yarn http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552523915000

and I do have HADOOP_CONF_DIR and YARN_CONF_DIR, i am not sure where should I set these ENV, should they be put in a specified conf file or just in the bash with export ... (jeg) root@Gondolin-Gateway:/opt/work# echo $HADOOP_CONF_DIR /opt/work/hadoop-2.7.2/etc/hadoop (jeg) root@Gondolin-Gateway:/opt/work# echo $YARN_CONF_DIR /opt/work/hadoop-2.7.2/etc/hadoop (jeg) root@Gondolin-Gateway:/opt/work# echo $EG_YARN_ENDPOINT http://Gondolin-Node-065:8188/ws/v1/cluster

I set ENVs in spark-submit, now the error is as below. wrong port 8088. [D 2019-03-14 09:15:56.017 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '4220f398-9c8a-4afd-b989-624f32b47292' - retrying... [I 190314 09:15:56 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps? startedTimeBegin=1552526134000 [W 2019-03-14 09:15:56.518 EnterpriseGatewayApp] Query for kernel ID '4220f398-9c8a-4afd-b989-624f32b47292' failed with exception: <class 'http.client.CannotSendRequest'> - 'Request-sent'. Continuing...

and 13 exit code Application application_1551765709575_0124 failed 2 times due to AM Container for appattempt_1551765709575_0124_000002 exited with exitCode: 13 For more detailed output, check application tracking page:http://Gondolin-Node-065:8188/cluster/app/application_1551765709575_0124Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1551765709575_0124_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 13 Failing this attempt. Failing the application.

glorysdj commented 5 years ago

if I change the port for YARN to 8088. the request error is gone. only 13 exit code still there. I will tray a py27 env.

[I 190314 09:49:07 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552528140000 [D 2019-03-14 09:49:07.987 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3' - retrying... 2019-03-14 09:49:08 INFO Client:54 - Uploading resource file:/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/scripts/launch_ipykernel.py -> hdfs://Gondolin-Node-065:9000/user/root/.sparkStaging/application_1552527763546_0002/launch_ipykernel.py 2019-03-14 09:49:08 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip -> hdfs://Gondolin-Node-065:9000/user/root/.sparkStaging/application_1552527763546_0002/pyspark.zip 2019-03-14 09:49:08 INFO Client:54 - Uploading resource file:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip -> hdfs://Gondolin-Node-065:9000/user/root/.sparkStaging/application_1552527763546_0002/py4j-0.10.7-src.zip [I 190314 09:49:08 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552528140000 [D 2019-03-14 09:49:08.493 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3' - retrying... 2019-03-14 09:49:08 INFO Client:54 - Uploading resource file:/tmp/spark-494c1edc-cd65-4cce-9c13-ce1cc0dce579/spark_conf1254890242240210983.zip -> hdfs://Gondolin-Node-065:9000/user/root/.sparkStaging/application_1552527763546_0002/spark_conf.zip 2019-03-14 09:49:08 INFO SecurityManager:54 - Changing view acls to: root 2019-03-14 09:49:08 INFO SecurityManager:54 - Changing modify acls to: root 2019-03-14 09:49:08 INFO SecurityManager:54 - Changing view acls groups to: 2019-03-14 09:49:08 INFO SecurityManager:54 - Changing modify acls groups to: 2019-03-14 09:49:08 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() [I 190314 09:49:08 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552528140000 [D 2019-03-14 09:49:08.997 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3' - retrying... [I 190314 09:49:09 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552528140000 [D 2019-03-14 09:49:09.500 EnterpriseGatewayApp] ApplicationID not yet assigned for KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3' - retrying... 2019-03-14 09:49:09 INFO Client:54 - Submitting application application_1552527763546_0002 to ResourceManager 2019-03-14 09:49:09 INFO YarnClientImpl:273 - Submitted application application_1552527763546_0002 2019-03-14 09:49:09 INFO Client:54 - Application report for application_1552527763546_0002 (state: ACCEPTED) 2019-03-14 09:49:09 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1552528281095 final status: UNDEFINED tracking URL: http://Gondolin-Node-065:8088/proxy/application_1552527763546_0002/ user: root 2019-03-14 09:49:09 INFO ShutdownHookManager:54 - Shutdown hook called 2019-03-14 09:49:09 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-494c1edc-cd65-4cce-9c13-ce1cc0dce579 2019-03-14 09:49:09 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-9176fce4-ad61-4435-bc06-2a6406f07d53 [I 190314 09:49:10 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552528140000 [I 2019-03-14 09:49:10.008 EnterpriseGatewayApp] ApplicationID: 'application_1552527763546_0002' assigned for KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3', state: ACCEPTED, 10.0 seconds after starting. [I 190314 09:49:10 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps/application_1552527763546_0002 [D 2019-03-14 09:49:10.013 EnterpriseGatewayApp] 19: State: 'ACCEPTED', Host: 'Gondolin-Node-087', KernelID: '878db4c1-410d-4ced-a087-a2e2bd3872f3', ApplicationID: 'application_1552527763546_0002'

[D 2019-03-14 09:59:10.113 EnterpriseGatewayApp] Waiting for KernelID '878db4c1-410d-4ced-a087-a2e2bd3872f3' to send connection info from host 'Gondolin-Node-087' - retrying... [W 2019-03-14 09:59:10.614 EnterpriseGatewayApp] Query for application 'application_1552527763546_0002' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing... [E 190314 09:59:10 web:1788] Uncaught exception POST /api/kernels (10.239.47.211) HTTPServerRequest(protocol='http', host='10.239.47.211:6666', method='POST', uri='/api/kernels', version='HTTP/1.1', remote_ip='10.239.47.211') Traceback (most recent call last): File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute result = await result File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/kernel_gateway/services/kernels/handlers.py", line 71, in post yield super(MainKernelHandler, self).post() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 47, in post kernel_id = yield gen.maybe_future(km.start_kernel(kernel_name=model['name'])) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 31, in start_kernel kernel_id = yield gen.maybe_future(super(RemoteMappingKernelManager, self).start_kernel(args, kwargs)) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 736, in run yielded = self.gen.throw(exc_info) # type: ignore File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/kernel_gateway/services/kernels/manager.py", line 81, in start_kernel kernel_id = yield gen.maybe_future(super(SeedingMappingKernelManager, self).start_kernel(args, kwargs)) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 729, in run value = future.result() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper yielded = next(result) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/notebook/services/kernels/kernelmanager.py", line 160, in start_kernel super(MappingKernelManager, self).start_kernel(kwargs) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 110, in start_kernel km.start_kernel(kwargs) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 105, in start_kernel return super(RemoteKernelManager, self).start_kernel(kw) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/jupyter_client/manager.py", line 259, in start_kernel kw) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 134, in _launch_kernel return self.process_proxy.launch_process(kernel_cmd, kw) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 61, in launch_process self.confirm_remote_startup(kernel_cmd, kw) File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 156, in confirm_remote_startup self.handle_timeout() File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 201, in handle_timeout if self.query_app_state_by_id(self.application_id) != "RUNNING": File "/opt/work/conda/envs/jeg/lib/python3.6/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 301, in query_app_state_by_id return response.data['state'] AttributeError: 'NoneType' object has no attribute 'data' [E 190314 09:59:10 web:2246] 500 POST /api/kernels (10.239.47.211) 610227.55ms

glorysdj commented 5 years ago

changed to python 2

{ "language": "python", "display_name": "Spark - Python (YARN Cluster Mode)", "process_proxy": { "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy" }, "env": { "SPARK_HOME": "/opt/work/spark-2.4.0-bin-hadoop2.7", "PYSPARK_PYTHON": "/opt/work/conda/envs/jegpy27/bin/python", "PYTHONPATH": "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip", "SPARK_OPTS": "--master yarn --deploy-mode cluster --name ${KERNEL_ID:-ERRORNOKERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false --conf spark.yarn.appMasterEnv.PYTHONUSERBASE=/home/${KERNEL_USERNAME}/.local --conf spark.yarn.appMasterEnv.PYTHONPATH=/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages:/opt/work/spark-2.4.0-bin-hadoop2.7/python:/opt/work/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip --conf spark.yarn.appMasterEnv.PATH=/opt/work/conda/bin:$PATH", "LAUNCH_OPTS": "" }, "argv": [ "/usr/local/share/jupyter/kernels/spark_python_yarn_cluster/bin/run.sh", "{connection_file}", "--RemoteProcessProxy.response-address", "{response_address}", "--RemoteProcessProxy.port-range", "{port_range}", "--RemoteProcessProxy.spark-context-initialization-mode", "lazy" ] }

still same error exitcode 13

Application application_1552527763546_0003 failed 2 times due to AM Container for appattempt_1552527763546_0003_000002 exited with exitCode: 13 For more detailed output, check application tracking page:http://Gondolin-Node-065:8088/cluster/app/application_1552527763546_0003Then, click on links to logs of each attempt. Diagnostics: Exception from container-launch. Container id: container_1552527763546_0003_02_000001 Exit code: 13 Stack trace: ExitCodeException exitCode=13: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 13 Failing this attempt. Failing the application.

I found this in YARN slave. My question is that PYSPARK_PYTHON should be prepared in each YARN slave node?

2019-03-14 10:02:36 INFO SignalUtils:54 - Registered signal handler for TERM 2019-03-14 10:02:36 INFO SignalUtils:54 - Registered signal handler for HUP 2019-03-14 10:02:36 INFO SignalUtils:54 - Registered signal handler for INT 2019-03-14 10:02:37 INFO SecurityManager:54 - Changing view acls to: root 2019-03-14 10:02:37 INFO SecurityManager:54 - Changing modify acls to: root 2019-03-14 10:02:37 INFO SecurityManager:54 - Changing view acls groups to: 2019-03-14 10:02:37 INFO SecurityManager:54 - Changing modify acls groups to: 2019-03-14 10:02:37 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2019-03-14 10:02:37 INFO ApplicationMaster:54 - Preparing Local resources 2019-03-14 10:02:38 INFO ApplicationMaster:54 - ApplicationAttemptId: appattempt_1552527763546_0003_000002 2019-03-14 10:02:38 INFO ApplicationMaster:54 - Starting the user application in a separate Thread 2019-03-14 10:02:38 INFO ApplicationMaster:54 - Waiting for spark context initialization... 2019-03-14 10:02:38 ERROR ApplicationMaster:91 - User class threw exception: java.io.IOException: Cannot run program "/opt/work/conda/envs/jegpy27/bin/python": error=2, No such file or directory java.io.IOException: Cannot run program "/opt/work/conda/envs/jegpy27/bin/python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:248) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 7 more 2019-03-14 10:02:38 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.io.IOException: Cannot run program "/opt/work/conda/envs/jegpy27/bin/python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:248) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 7 more ) 2019-03-14 10:02:38 ERROR ApplicationMaster:91 - Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) Caused by: java.io.IOException: Cannot run program "/opt/work/conda/envs/jegpy27/bin/python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:100) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:248) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 7 more 2019-03-14 10:02:38 INFO ApplicationMaster:54 - Deleting staging directory hdfs://Gondolin-Node-065:9000/user/root/.sparkStaging/application_1552527763546_0003 2019-03-14 10:02:38 INFO ShutdownHookManager:54 - Shutdown hook called

lresende commented 5 years ago

Looks like one of the issues is related to custom yarn port (see #601 / #602)

lresende commented 5 years ago

I found this in YARN slave. My question is that PYSPARK_PYTHON should be prepared in each YARN slave node?

Yes, you should have anaconda env and any other dependency available in all nodes.

glorysdj commented 5 years ago

@lresende Do you mean that if I want to run EG with YARN cluster, I need to prepare spark distribution and python env with jupyter EG on each YARN node manager?

lresende commented 5 years ago

@lresende Do you mean that if I want to run EG with YARN cluster, I need to prepare spark distribution and python env with jupyter EG on each YARN node manager?

EG performs a spark-submit of a python script that is started on a YARN worker node which will launch the IPYthon kernel and execute the python source code from the notebook. Based on this, you will need the following configuration to run EG + Spark in YARN Cluster mode:

image
glorysdj commented 5 years ago

Thanks @lresende . I will try that.

glorysdj commented 5 years ago

I have installed python env with anacoda and ipryhon, eg on each YARN node manager. no 13 exit code for the yarn application and yarn app is running, but still get below error:

[W 2019-03-14 13:24:35.839 EnterpriseGatewayApp] Query for application 'application_1552527763546_0005' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing... [E 190314 13:24:35 ioloop:801] Exception in callback <bound method IOLoopKernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7f60842ee990>> Traceback (most recent call last): File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/tornado/ioloop.py", line 1229, in _run return self.callback() File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/jupyter_client/restarter.py", line 93, in poll if not self.kernel_manager.is_alive(): File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/jupyter_client/manager.py", line 453, in is_alive if self.kernel.poll() is None: File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 75, in poll state = self.query_app_state_by_id(self.application_id) File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 301, in query_app_state_by_id return response.data['state'] AttributeError: 'NoneType' object has no attribute 'data'

lresende commented 5 years ago

[W 2019-03-14 13:24:35.839 EnterpriseGatewayApp] Query for application 'application_1552527763546_0005' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing...

Ok, in this case, what is the port used by YARN RM ? It should be 8088 untill the PR I mentioned above gets merged. Otherwise, it seems that the RM is not returning the expected response.

glorysdj commented 5 years ago

[W 2019-03-14 13:24:35.839 EnterpriseGatewayApp] Query for application 'application_1552527763546_0005' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing...

Ok, in this case, what is the port used by YARN RM ? It should be 8088 untill the PR I mentioned above gets merged. Otherwise, it seems that the RM is not returning the expected response.

  1. Now we are using 8088.

and fyi, jupyter and tornado versions:

jupyter_client 5.2.4 py_3 conda-forge jupyter_core 4.4.0 py_0 conda-forge jupyter_enterprise_gateway 1.1.1 py_0 conda-forge jupyter_kernel_gateway 2.2.0 py_0 conda-forge tornado 5.1.1 py27h14c3975_1000 conda-forge

kevin-bates commented 5 years ago

yarn-api-client needs to be upgraded. There was a time we capped it < 0.3.0, so I thought 0.2.3 was correct. But the missing attribute exception indicates an upgrade is required. And setup.py confirms. Sorry about that

glorysdj commented 5 years ago

now yarn-api-client is 0.2.3 yarn-api-client 0.2.3 py_0 conda-forge what is the right version? 0.3.2?

I have upgraded it to 0.3.2. Now, yarn app is running, but can not run any cell in the notebook, and the kernel will soon be dead and restarted. logs attahced

[I 190315 09:39:14 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps?startedTimeBegin=1552613943000 [I 2019-03-15 09:39:14.537 EnterpriseGatewayApp] ApplicationID: 'application_1552527763546_0007' assigned for KernelID: '8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb', state: ACCEPTED, 11.0 seconds after starting. [I 190315 09:39:14 base:32] Request http://gondolin-node-065:8088/ws/v1/cluster/apps/application_1552527763546_0007 [D 2019-03-15 09:39:14.546 EnterpriseGatewayApp] 21: State: 'ACCEPTED', Host: 'Gondolin-Node-087', KernelID: '8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb', ApplicationID: 'application_1552527763546_0007' [D 2019-03-15 09:39:20.659 EnterpriseGatewayApp] Received Payload 'x3+pvr1B1NOxx9npUKrz64YBqzE1nai3wVt6JXjxFb7aCDAr8MoKSfZqG8gVJWAyChS+VvsauN6uKTKs0vjY8XLgFUev1KPhkF/8dEnTqpEP/nzg5LZwRwx0KH6ilIHrVWB+LD0xwwEKrQ5ll2cXb0C8qrwjy7Y1MVcWbSKHeUMad9dHYnCJNx+xmRtHMXqPu5F43ipwJEpFE7mWYg+nolH1B4XnpnlH45p3gHPPFWGDtl1tQKBgaf+gIkxI2YRzgvrC5Bskujyp1Ave4I+Cse5RBe1ncAHNb1i13BZkjQMfic5H8qm1bVr6jO5Flvi9LONUmXvilw8PKuOmPdTbIiqFlPP5iqXfJdHnz6bwX3clXmtNpJe1lAquFopmHzLVZZ2uIWCL/7qBnQTM9mTnrA==' [D 2019-03-15 09:39:20.660 EnterpriseGatewayApp] Decrypted Payload '{"stdin_port": 57770, "pgid": "4782", "ip": "0.0.0.0", "pid": "4861", "control_port": 43436, "hb_port": 57878, "signature_scheme": "hmac-sha256", "key": "25599fac-5b8a-41a7-8cb4-99f2b0da9c34", "comm_port": 42483, "kernel_name": "", "shell_port": 53018, "transport": "tcp", "iopub_port": 33993}' [D 2019-03-15 09:39:20.660 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{u'stdin_port': 57770, u'pgid': u'4782', u'ip': u'0.0.0.0', u'pid': u'4861', u'control_port': 43436, u'hb_port': 57878, u'signature_scheme': u'hmac-sha256', u'key': u'25599fac-5b8a-41a7-8cb4-99f2b0da9c34', u'comm_port': 42483, u'kernel_name': u'', u'shell_port': 53018, u'transport': u'tcp', u'iopub_port': 33993}' [D 2019-03-15 09:39:20.660 EnterpriseGatewayApp] Host assigned to the Kernel is: 'Gondolin-Node-087' '172.168.2.187' [D 2019-03-15 09:39:20.661 EnterpriseGatewayApp] Established gateway communication to: 172.168.2.187:42483 for KernelID '8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb' [D 2019-03-15 09:39:20.661 EnterpriseGatewayApp] Updated pid to: 4861 [D 2019-03-15 09:39:20.661 EnterpriseGatewayApp] Updated pgid to: 4782 [D 2019-03-15 09:39:20.666 EnterpriseGatewayApp] Received connection info for KernelID '8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb' from host 'Gondolin-Node-087': {u'stdin_port': 57770, u'ip': '172.168.2.187', u'control_port': 43436, u'hb_port': 57878, u'signature_scheme': u'hmac-sha256', u'key': u'25599fac-5b8a-41a7-8cb4-99f2b0da9c34', u'comm_port': 42483, u'kernel_name': u'', u'shell_port': 53018, u'transport': u'tcp', u'iopub_port': 33993}... [D 2019-03-15 09:39:20.669 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:43436 [D 2019-03-15 09:39:20.671 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:33993 [I 2019-03-15 09:39:20.673 EnterpriseGatewayApp] Kernel started: 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb [D 2019-03-15 09:39:20.673 EnterpriseGatewayApp] Kernel args: {'kernel_name': u'spark_python_yarn_cluster', 'env': {'PATH': '/opt/work/conda/envs/jegpy27/bin:/opt/work/hadoop-2.7.2/bin:/bin:/opt/jdk1.8.0_152/bin:/bin:/opt/work/conda/envs/jegpy27/bin:/opt/work/conda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games', u'KERNEL_WORKING_DIR': u'/opt/work', u'KERNEL_USERNAME': u'arda'}} [I 190315 09:39:20 web:2162] 201 POST /api/kernels (10.239.47.211) 16793.17ms [I 190315 09:39:20 web:2162] 200 GET /api/kernels/8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb (10.239.47.211) 1.47ms [D 2019-03-15 09:39:20.798 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb/channels [W 2019-03-15 09:39:20.802 EnterpriseGatewayApp] No session ID specified [D 2019-03-15 09:39:20.802 EnterpriseGatewayApp] Requesting kernel info from 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb [D 2019-03-15 09:39:20.803 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:53018 [D 2019-03-15 09:39:21.117 EnterpriseGatewayApp] activity on 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb: status [D 2019-03-15 09:39:21.121 EnterpriseGatewayApp] activity on 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb: status [D 2019-03-15 09:39:21.124 EnterpriseGatewayApp] Received kernel info: {u'status': u'ok', u'language_info': {u'mimetype': u'text/x-python', u'nbconvert_exporter': u'python', u'name': u'python', u'pygments_lexer': u'ipython2', u'version': u'2.7.15', u'file_extension': u'.py', u'codemirror_mode': {u'version': 2, u'name': u'ipython'}}, u'implementation': u'ipython', u'implementation_version': u'5.8.0', u'protocol_version': u'5.1', u'banner': u'Python 2.7.15 | packaged by conda-forge | (default, Feb 28 2019, 04:00:11) \nType "copyright", "credits" or "license" for more information.\n\nIPython 5.8.0 -- An enhanced Interactive Python.\n? -> Introduction and overview of IPython\'s features.\n%quickref -> Quick reference.\nhelp -> Python\'s own help system.\nobject? -> Details about \'object\', use \'object??\' for extra details.\n', u'help_links': [{u'url': u'https://docs.python.org/2.7', u'text': u'Python Reference'}, {u'url': u'https://ipython.org/documentation.html', u'text': u'IPython Reference'}, {u'url': u'https://docs.scipy.org/doc/numpy/reference/', u'text': u'NumPy Reference'}, {u'url': u'https://docs.scipy.org/doc/scipy/reference/', u'text': u'SciPy Reference'}, {u'url': u'https://matplotlib.org/contents.html', u'text': u'Matplotlib Reference'}, {u'url': u'http://docs.sympy.org/latest/index.html', u'text': u'SymPy Reference'}, {u'url': u'https://pandas.pydata.org/pandas-docs/stable/', u'text': u'pandas Reference'}]} [I 2019-03-15 09:39:21.124 EnterpriseGatewayApp] Adapting to protocol v5.1 for kernel 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb [D 2019-03-15 09:39:21.127 EnterpriseGatewayApp] activity on 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb: status [I 190315 09:39:21 web:2162] 101 GET /api/kernels/8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb/channels (10.239.47.211) 330.08ms [D 2019-03-15 09:39:21.128 EnterpriseGatewayApp] Opening websocket /api/kernels/8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb/channels [D 2019-03-15 09:39:21.129 EnterpriseGatewayApp] Getting buffer for 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb [D 2019-03-15 09:39:21.129 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:53018 [D 2019-03-15 09:39:21.129 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:33993 [D 2019-03-15 09:39:21.129 EnterpriseGatewayApp] Connecting to: tcp://172.168.2.187:57770 [D 2019-03-15 09:39:21.135 EnterpriseGatewayApp] activity on 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb: status [D 2019-03-15 09:39:21.140 EnterpriseGatewayApp] activity on 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb: status [W 2019-03-15 09:40:39.154 EnterpriseGatewayApp] Query for application state with cmd '['curl', '-X', 'GET', u'http://Gondolin-Node-065:8088/ws/v1/cluster/apps/application_1552527763546_0007/state']' failed with exception: 'No JSON object could be decoded'. Continuing... [I 2019-03-15 09:40:39.155 EnterpriseGatewayApp] KernelRestarter: restarting kernel (1/5), new random ports [W 190315 09:40:39 handlers:472] kernel 8d6c1bcb-8fb0-48c3-9ee5-7fbc64b8ebcb restarted [D 2019-03-15 09:40:39.158 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(9) [D 2019-03-15 09:40:39.158 EnterpriseGatewayApp] YarnClusterProcessProxy.send_signal 9 [W 2019-03-15 09:41:54.919 EnterpriseGatewayApp] Termination of application with cmd '['curl', '-X', 'PUT', '-H', 'Content-Type: application/json', '-d', '{"state": "KILLED"}', u'http://Gondolin-Node-065:8088/ws/v1/cluster/apps/application_1552527763546_0007/state']' failed with exception: 'No JSON object could be decoded'. Continuing... [D 2019-03-15 09:41:54.920 EnterpriseGatewayApp] YarnClusterProcessProxy.kill: kill_app_by_id(application_1552527763546_0007) response: None, confirming app state is not RUNNING

kevin-bates commented 5 years ago

@glorysdj - Yes, you want to be using yarn-api-client of 0.3.2. However, this message here:

W 2019-03-15 09:40:39.154 EnterpriseGatewayApp] Query for application state with cmd '['curl', '-X', 'GET', u'http://Gondolin-Node-065:8088/ws/v1/cluster/apps/application_1552527763546_0007/state']' failed with exception: 'No JSON object could be decoded'. Continuing...

has not existed in EG since the 1.0.2 release, yet you show yourself running the 1.1.1 release which uses yarn-api-client to get the application state - rather than curl!

I wonder if conda-forge has an issue here?! I remember there being some kind of issue - which I thought was relative to yarn-api-client - but now I'm wondering about EG as well!

Is there any way you can issue the following within the appropriate conda env for EG?

pip install --upgrade jupyter_enterprise_gateway

This should also update yarn-api-client to 0.3.2 - which should have been present all along. In fact, looking back at the dependencies for EG 1.0.2, I find a setting of 'yarn-api-client<0.3.0' - which explains why yarn-api-client was old, because EG is old.

If you force EG to 1.1.1, I suspect you'll be good. The rest of the startup looked good, but because EG could not determine the state of the YARN application, it decided that the kernel must not be running, so it enters into the framework's auto-restart cycle.

glorysdj commented 5 years ago

Hi Kevin

pip list

Package Version


asn1crypto 0.24.0 attrs 19.1.0 backports-abc 0.5 backports.shutil-get-terminal-size 1.0.0 bcrypt 3.1.4 bleach 3.1.0 certifi 2019.3.9 cffi 1.12.2 chardet 3.0.4 configparser 3.7.3 cryptography 2.6.1 decorator 4.3.2 defusedxml 0.5.0 entrypoints 0.3 enum34 1.1.6 functools32 3.2.3.post2 futures 3.2.0 idna 2.8 ipaddress 1.0.22 ipykernel 4.10.0 ipython 5.8.0 ipython-genutils 0.2.0 Jinja2 2.10 jsonschema 3.0.1 jupyter-client 5.2.4 jupyter-core 4.4.0 jupyter-enterprise-gateway 1.1.1 jupyter-kernel-gateway 2.2.0 MarkupSafe 1.1.1 mistune 0.8.4 nb2kg 0.7.0.dev0 nbconvert 5.4.1 nbformat 4.4.0 notebook 5.7.6 pandocfilters 1.4.2 paramiko 2.4.2 pathlib2 2.3.3 pexpect 4.6.0 pickleshare 0.7.5 pip 19.0.3 prometheus-client 0.6.0 prompt-toolkit 1.0.15 ptyprocess 0.6.0 pyasn1 0.4.4 pycparser 2.19 pycrypto 2.6.1 Pygments 2.3.1 pykerberos 1.1.14 PyNaCl 1.3.0 pyOpenSSL 19.0.0 pyrsistent 0.14.11 PySocks 1.6.8 python-dateutil 2.8.0 pyzmq 18.0.1 requests 2.21.0 requests-kerberos 0.12.0 scandir 1.10.0 Send2Trash 1.5.0 setuptools 40.8.0 simplegeneric 0.8.1 singledispatch 3.4.0.3 six 1.12.0 terminado 0.8.1 testpath 0.4.2 tornado 5.1.1 traitlets 4.3.2 urllib3 1.24.1 wcwidth 0.1.7 webencodings 0.5.1 wheel 0.33.1 yarn-api-client 0.3.2

but will get this W

[W 2019-03-15 13:18:37.762 EnterpriseGatewayApp] Query for application 'application_1552527763546_0011' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing... [E 190315 13:18:37 ioloop:801] Exception in callback <bound method IOLoopKernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7feefc9191d0>> Traceback (most recent call last): File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/tornado/ioloop.py", line 1229, in _run return self.callback() File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/jupyter_client/restarter.py", line 93, in poll if not self.kernel_manager.is_alive(): File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/jupyter_client/manager.py", line 453, in is_alive if self.kernel.poll() is None: File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 75, in poll state = self.query_app_state_by_id(self.application_id) File "/opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/enterprise_gateway/services/processproxies/yarn.py", line 301, in query_app_state_by_id return response.data['state'] AttributeError: 'NoneType' object has no attribute 'data'

kevin-bates commented 5 years ago

I'm going to go insane! This implies an old yarn-api-client again!

[W 2019-03-15 13:18:37.762 EnterpriseGatewayApp] Query for application 'application_1552527763546_0011' state failed with exception: ''ResourceManager' object has no attribute 'cluster_application_state''. Continuing...

Could you please check in the directory /opt/work/conda/envs/jegpy27/lib/python2.7/site-packages/yarn_api_client and look at the __version__ value in __init__.py. It should read, '0.3.2'. Then check resource_manager.py and ensure that line 218 is the start of the cluster_application_state method?
There's something strange happening here.

glorysdj commented 5 years ago

yes, thanks for your great help. I will check the versions again!

glorysdj commented 5 years ago

Finally... it works! Cheers! Thanks kevin and @lresende Hi @kevin-bates, do you have any notebook examples which can be run on the jeg spark python on yarn cluster?

kevin-bates commented 5 years ago

Fantastic news @glorysdj! Thank you for your patience and persistence.

Could you please share what you had to do to get the correct versions installed and running?

Here's a recap of the issues encountered:

  1. Kernelspecs from 2.x are not compatible with 1.x EG. We could probably add something into 1.x to detect this and fail the request if it finds process-proxy stanzas in the metadata stanza. This scenario should probably be addressed whenever get around to implementing #195.
  2. Python versions between EG and spark were mixed. Not necessarily an issue but requires careful configuration.
  3. Yarn was configured with a custom port. PR #602
  4. All nodes were not configured with Jupyter-specific requirements.
  5. Wrong versions of EG and yarn-api-client were in use despite what pip list produced. Very strange.

The notebooks I have mostly just checkout the spark context a bit and include cells that help with testing the interactions between EG and the kernel (interrupts, restarts, etc.). Here's a notebook with a few cells that perform some operations with the spark context and run the Pi sample (in honor of Pi day yesterday :smile:). It also produces a plot. I find I have to run the cell twice to get the pretty picture produced. I suspect that's more an issue with plotting than anything else. If you have conda on your worker nodes, it should work fine, otherwise you'd need to make sure those packages are installed: python - yarn cluster.ipynb.zip

kevin-bates commented 5 years ago

@glorysdj - could you please provide any steps/tricks you had to use to resolve item 5 above?

I think we can go ahead and close this issue unless you feel otherwise.

glorysdj commented 5 years ago

Thanks Kevin. Later I will review the deployment and check all the steps and tricks. And after that, we can close this. Thanks for the patience.

glorysdj commented 5 years ago

As Kevin's recap, the five tricks help to deploy jupyter enterprise gateway and spark python yarn cluster kernel. Thanks for help. Issue closed.

kevin-bates commented 5 years ago

Thanks @glorysdj - any clues as to why pip list was showing the wrong versions of things and how those were ultimately installed to correct the situation?

glorysdj commented 5 years ago

Thanks @glorysdj - any clues as to why pip list was showing the wrong versions of things and how those were ultimately installed to correct the situation?

It seems that the install of yarn-api-client is failed but pip list the later version, what i have done is uninstall it and install it with required version, and this solved the problem.

kevin-bates commented 5 years ago

Perfect - thank you. Yeah, I've experienced similar things and try to consistently perform uninstalls.