jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
621 stars 221 forks source link

Kernels are not shutting down in YARN Cluster Mode #511

Closed kevin-bates closed 5 years ago

kevin-bates commented 5 years ago

@amangarg96 - I'm copying your comment from Issue #394 and using it to create a new issue, since the two are not really related. We can use this issue to get to the bottom of things. Here is that comment...

Hi,

I am using Jupyter Enterprise Gateway in Yarn Cluster Mode, and launching Jupyter Lab server on a remote Linux box.

I am facing the same 'HTTP 403: Failed to authenticate SSHClient with password-less SSH' issue, and here is the log for the same.

[I 2018-12-01 19:45:59.421 EnterpriseGatewayApp] Kernel shutdown: 0836005e-2a6b-4c5a-a825-f49bcdef5f30 [I 2018-12-01 19:49:56.613 EnterpriseGatewayApp] KernelRestarter: restarting kernel (1/5), keep random ports [W 2018-12-01 19:49:56.613 EnterpriseGatewayApp] Remote kernel (d90ff5df-2621-4a36-a0d3-1be2f8da1450) will not be automatically restarted since there are no clients connected at this time. [W 2018-12-01 19:49:56.618 EnterpriseGatewayApp] Termination of application 'application_1542713397395_139615' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and Java type class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and MIME media type application/octet-stream was not found.\nThe registered message body readers compatible with the MIME media type are:\napplication/octet-stream ->\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.RenderedImageProvider\n/ ->\n com.sun.jersey.core.impl.provider.entity.FormProvider\n com.sun.jersey.json.impl.provider.entity.JSONJAXBElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONArrayProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.StringProvider\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General\n com.sun.jersey.core.impl.provider.entity.ReaderProvider\n com.sun.jersey.core.impl.provider.entity.DocumentProvider\n com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader\n com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONListElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JacksonProviderProxy\n com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.EntityHolderReader\n","javaClassName":"javax.ws.rs.WebApplicationException"}}'. Continuing... [E 2018-12-01 19:49:56.673 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH [W 2018-12-01 19:49:56.673 EnterpriseGatewayApp] Remote signal(15) to '-32518' on host '' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'.

I believe that this has interfered with Kernel lifecycle, and hence I am getting 'orphan kernels'. Orphan kernels are the one for which the spark submit was done, the job goes into running state, but the notebook server does not interact with that kernel, so it does not get shutdown when notebook server is shutdown.

kevin-bates commented 5 years ago

Hi @amangarg96. The 403 issue you're encountering using YARN Cluster Mode is really a side affect stemming from a failure to shutdown the kernel in the first attempt. In those cases, EG will resort to attempting to issue a remote SSH command to issue a kill directly. However, if you were to configure password-less SSH, then this issue may have not been detected. To properly release the resources in use by these kernels, you'll need to terminate them manually - typically via the YARN UI - until getting to the bottom of why the initial shutdown fails.

The real issue is a failure occurring in the initial shutdown attempt - which uses the YARN client API to terminate the YARN application.

[W 2018-12-01 19:49:56.618 EnterpriseGatewayApp] Termination of application 'application_1542713397395_139615' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class

So this indicates that YARN is encountering an exception in what looks to be an XML reader class.

I would suggest gathering the YARN application logs (mostly stdout and stderr) and seeing if there are more clues there. If you attach those logs, along with the complete Enterprise Gateway log, we can troubleshoot this with others.

A quick google of "javax.ws.rs.WebApplicationException YARN" yields: https://community.hortonworks.com/questions/226051/yarn-service-rest-api-genericexceptionhandler-weba.html

As a result, you might try issuing the termination using curl, first to reproduce, then again specifying the content type. We use this call against the yarn-api-client to perform the termination.

It would also be good to know the version of YARN and Spark you're using - since we don't see these issues.

Thanks.

cc: @lresende

kevin-bates commented 5 years ago

@amangarg96 - have you had a chance to see what's going on in the YARN logs?

amangarg96 commented 5 years ago

I haven't faced this issue again in the last couple of days, and I haven't been able to find the time to reproduce the issue.

I'll take this up on Monday again.

That being said, I see a lot less orphan kernels and it would be interesting to see what was changed from my previous configuration.

amangarg96 commented 5 years ago

SPARK version - 2.2.1 YARN - 2.9

Gateway logs

[I 181207 13:37:45 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.36ms [I 181207 13:37:54 web:2162] 200 GET /api/kernels/5821cf73-1cb9-420a-a3a0-1a02d9080e71 (127.0.0.1) 0.47ms [I 181207 13:37:54 web:2162] 200 GET /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 0.48ms [I 181207 13:37:55 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.40ms [I 181207 13:38:04 web:2162] 200 GET /api/kernels/5821cf73-1cb9-420a-a3a0-1a02d9080e71 (127.0.0.1) 0.49ms [I 181207 13:38:04 web:2162] 200 GET /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 0.36ms [I 181207 13:38:05 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.40ms [I 2018-12-07 13:38:09.581 EnterpriseGatewayApp] Starting buffering for 75782334-241c-467a-9614-d273c3aa07c5:35f612f8-cc2f86a1844685ebea5b8b57 [I 2018-12-07 13:38:09.582 EnterpriseGatewayApp] Starting buffering for 5821cf73-1cb9-420a-a3a0-1a02d9080e71:63b4148d-d829fdc4682a7073802234b9 [I 2018-12-07 13:38:09.584 EnterpriseGatewayApp] Starting buffering for 1738bd98-9b94-4240-81fa-17f62798397a:0c16a4d7-189af2f90d2c5a9d4e723a3a [I 181207 15:45:32 web:2162] 200 GET /api/kernels/5821cf73-1cb9-420a-a3a0-1a02d9080e71 (127.0.0.1) 0.53ms [I 181207 15:45:32 web:2162] 200 GET /api/kernelspecs (127.0.0.1) 3.29ms [I 181207 15:45:32 web:2162] 200 GET /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 0.36ms [I 181207 15:45:32 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.36ms [I 181207 15:45:38 web:2162] 200 GET /api/kernelspecs (127.0.0.1) 3.43ms [I 181207 15:45:39 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.39ms [W 2018-12-07 15:45:39.579 EnterpriseGatewayApp] No session ID specified [I 2018-12-07 15:45:39.580 EnterpriseGatewayApp] Adapting to protocol v5.1 for kernel 75782334-241c-467a-9614-d273c3aa07c5 [I 181207 15:45:39 web:2162] 101 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5/channels (127.0.0.1) 1.65ms [I 181207 15:46:20 web:2162] 200 GET /api/kernels/5821cf73-1cb9-420a-a3a0-1a02d9080e71 (127.0.0.1) 0.52ms [I 181207 15:46:20 web:2162] 200 GET /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 0.39ms [I 181207 15:46:20 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.37ms [I 181207 16:00:52 web:2162] 200 GET /api/kernels/5821cf73-1cb9-420a-a3a0-1a02d9080e71 (127.0.0.1) 0.49ms [I 181207 16:00:52 web:2162] 200 GET /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 0.36ms [I 181207 16:00:52 web:2162] 200 GET /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 0.35ms [I 2018-12-07 17:59:42.520 EnterpriseGatewayApp] Kernel interrupted: 75782334-241c-467a-9614-d273c3aa07c5 [I 181207 17:59:42 web:2162] 204 POST /api/kernels/75782334-241c-467a-9614-d273c3aa07c5/interrupt (127.0.0.1) 0.98ms [W 2018-12-07 18:33:39.582 EnterpriseGatewayApp] WebSocket ping timeout after 119919 ms. [I 2018-12-07 18:33:44.584 EnterpriseGatewayApp] Starting buffering for 75782334-241c-467a-9614-d273c3aa07c5:0f4ec928-146e46aa5c57b922be15091a [I 2018-12-07 19:33:30.080 EnterpriseGatewayApp] Kernel shutdown: 1738bd98-9b94-4240-81fa-17f62798397a [I 181207 19:33:30 web:2162] 204 DELETE /api/kernels/1738bd98-9b94-4240-81fa-17f62798397a (127.0.0.1) 201.98ms [W 2018-12-07 19:33:35.645 EnterpriseGatewayApp] Termination of application 'application_1543928474139_36333' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and Java type class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and MIME media type application/octet-stream was not found.\nThe registered message body readers compatible with the MIME media type are:\napplication/octet-stream ->\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.RenderedImageProvider\n/ ->\n com.sun.jersey.core.impl.provider.entity.FormProvider\n com.sun.jersey.json.impl.provider.entity.JSONJAXBElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONArrayProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.StringProvider\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General\n com.sun.jersey.core.impl.provider.entity.ReaderProvider\n com.sun.jersey.core.impl.provider.entity.DocumentProvider\n com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader\n com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONListElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JacksonProviderProxy\n com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.EntityHolderReader\n","javaClassName":"javax.ws.rs.WebApplicationException"}}'. Continuing... [E 2018-12-07 19:33:40.782 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH [W 2018-12-07 19:33:40.782 EnterpriseGatewayApp] Remote signal(15) to '-10099' on host '10.34.187.37' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'. [E 2018-12-07 19:33:45.886 EnterpriseGatewayApp] Failed to authenticate SSHClient with password-less SSH [W 2018-12-07 19:33:45.886 EnterpriseGatewayApp] Remote signal(9) to '-10099' on host '10.34.187.37' failed with exception 'HTTP 403: Failed to authenticate SSHClient with password-less SSH'. [I 2018-12-07 19:33:45.894 EnterpriseGatewayApp] Kernel shutdown: 75782334-241c-467a-9614-d273c3aa07c5 [I 181207 19:33:45 web:2162] 204 DELETE /api/kernels/75782334-241c-467a-9614-d273c3aa07c5 (127.0.0.1) 15721.17ms

Even though the Gateway log shows 204 Delete for kernel-id:75782334-241c-467a-9614-d273c3aa07c5, the kernel is still in running state on YARN since 07-12-2018. (Orphan kernel)

amangarg96 commented 5 years ago

STDERR for this kernel

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/grid/1/yarn/local/usercache/fk-mlp-user/filecache/78/__spark_libs__1048407335554972528.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.0.0-169/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/07 12:20:48 INFO util.SignalUtils: Registered signal handler for TERM
18/12/07 12:20:48 INFO util.SignalUtils: Registered signal handler for HUP
18/12/07 12:20:48 INFO util.SignalUtils: Registered signal handler for INT
18/12/07 12:20:48 INFO yarn.ApplicationMaster: Preparing Local resources
18/12/07 12:20:49 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1543928474139_36333_000001
18/12/07 12:20:49 INFO spark.SecurityManager: Changing view acls to: yarn,fk-mlp-user
18/12/07 12:20:49 INFO spark.SecurityManager: Changing modify acls to: yarn,fk-mlp-user
18/12/07 12:20:49 INFO spark.SecurityManager: Changing view acls groups to: 
18/12/07 12:20:49 INFO spark.SecurityManager: Changing modify acls groups to: 
18/12/07 12:20:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, fk-mlp-user); groups with view permissions: Set(); users  with modify permissions: Set(yarn, fk-mlp-user); groups with modify permissions: Set()
18/12/07 12:20:49 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
18/12/07 12:20:49 INFO yarn.ApplicationMaster: Waiting for spark context initialization...
18/12/07 12:20:50 INFO spark.SparkContext: Running Spark version 2.2.1
18/12/07 12:20:51 INFO spark.SparkContext: Submitted application: 75782334-241c-467a-9614-d273c3aa07c5
18/12/07 12:20:51 INFO spark.SecurityManager: Changing view acls to: yarn,fk-mlp-user
18/12/07 12:20:51 INFO spark.SecurityManager: Changing modify acls to: yarn,fk-mlp-user
18/12/07 12:20:51 INFO spark.SecurityManager: Changing view acls groups to: 
18/12/07 12:20:51 INFO spark.SecurityManager: Changing modify acls groups to: 
18/12/07 12:20:51 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, fk-mlp-user); groups with view permissions: Set(); users  with modify permissions: Set(yarn, fk-mlp-user); groups with modify permissions: Set()
18/12/07 12:20:51 INFO util.Utils: Successfully started service 'sparkDriver' on port 15422.
18/12/07 12:20:51 INFO spark.SparkEnv: Registering MapOutputTracker
18/12/07 12:20:51 INFO spark.SparkEnv: Registering BlockManagerMaster
18/12/07 12:20:51 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/12/07 12:20:51 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/12/07 12:20:51 INFO storage.DiskBlockManager: Created local directory at /grid/1/yarn/local/usercache/fk-mlp-user/appcache/application_1543928474139_36333/blockmgr-11ba7839-b44b-4081-bb94-3691dfd00218
18/12/07 12:20:51 INFO memory.MemoryStore: MemoryStore started with capacity 5.2 GB
18/12/07 12:20:51 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/12/07 12:20:51 INFO util.log: Logging initialized @3793ms
18/12/07 12:20:51 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
18/12/07 12:20:51 INFO server.Server: jetty-9.3.z-SNAPSHOT
18/12/07 12:20:51 INFO server.Server: Started @3864ms
18/12/07 12:20:51 INFO server.AbstractConnector: Started ServerConnector@63577c22{HTTP/1.1,[http/1.1]}{0.0.0.0:17421}
18/12/07 12:20:51 INFO util.Utils: Successfully started service 'SparkUI' on port 17421.
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@301b0930{/jobs,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3a13a74b{/jobs/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@70154334{/jobs/job,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@27c20aef{/jobs/job/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15508b3f{/stages,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b177194{/stages/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@504ffbb1{/stages/stage,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2284358d{/stages/stage/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5847887f{/stages/pool,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@18b96628{/stages/pool/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73bdf687{/storage,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2bfb9f20{/storage/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4ccb01f6{/storage/rdd,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@42962c38{/storage/rdd/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@705b1a74{/environment,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2badb16a{/environment/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ca4e110{/executors,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f75f96f{/executors/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46fe1da6{/executors/threadDump,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@64b729c{/executors/threadDump/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@236ff732{/static,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@33be6589{/,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fa48b25{/api,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3fc1cf7b{/jobs/job/kill,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@766bc438{/stages/stage/kill,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://<ip>:<port>
18/12/07 12:20:51 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
18/12/07 12:20:51 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1543928474139_36333 and attemptId Some(appattempt_1543928474139_36333_000001)
18/12/07 12:20:51 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49174.
18/12/07 12:20:51 INFO netty.NettyBlockTransferService: Server created on <ip:port>
18/12/07 12:20:51 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/12/07 12:20:51 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, <ip>, <port>, None)
18/12/07 12:20:51 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.34.187.37:49174 with 5.2 GB RAM, BlockManagerId(driver, <ip>, <port>, None)
18/12/07 12:20:51 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, <ip>, <port>, None)
18/12/07 12:20:51 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, <ip>, <port>, None)
18/12/07 12:20:51 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66f93b78{/metrics/json,null,AVAILABLE,@Spark}
18/12/07 12:20:51 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
18/12/07 12:20:51 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
18/12/07 12:20:51 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@<ip>:<port2>)
18/12/07 12:20:52 INFO yarn.ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>/usr/hdp/current/hadoop-client/*<CPS>/usr/hdp/current/hadoop-client/lib/*<CPS>/usr/hdp/current/hadoop-hdfs-client/*<CPS>/usr/hdp/current/hadoop-hdfs-client/lib/*<CPS>/usr/hdp/current/hadoop-yarn-client/*<CPS>/usr/hdp/current/hadoop-yarn-client/lib/*<CPS>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.4.0.0-169.jar:/etc/hadoop/conf/secure
    SPARK_YARN_STAGING_DIR -> *********(redacted)
    SPARK_USER -> *********(redacted)
    SPARK_YARN_MODE -> true
    PYTHONPATH -> {{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip<CPS>{{PWD}}/mlsdk.zip

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx10240m \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@<ip>:<port2> \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1543928474139_36333 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_36333/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1544165297166 type: FILE visibility: PRIVATE
    mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_36333/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz" } size: 1683630994 timestamp: 1544165296954 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_36333/__spark_conf__.zip" } size: 102065 timestamp: 1544165297255 type: ARCHIVE visibility: PRIVATE
    pyspark.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_36333/pyspark.zip" } size: 482687 timestamp: 1544165297102 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/.sparkStaging/application_1543928474139_36333/__spark_libs__1048407335554972528.zip" } size: 203821116 timestamp: 1544165272761 type: ARCHIVE visibility: PRIVATE
    mlsdk.zip -> resource { scheme: "hdfs" host: "krios" port: -1 file: "/user/fk-mlp-user/notebooks/mlsdk/mlsdk.zip" } size: 113787 timestamp: 1543586447744 type: FILE visibility: PUBLIC

===============================================================================
18/12/07 12:20:52 INFO yarn.YarnRMClient: Registering the ApplicationMaster
18/12/07 12:20:52 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
18/12/07 12:20:52 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
amangarg96 commented 5 years ago

STDERR for kernel

Using connection file '/tmp/kernel-75782334-241c-467a-9614-d273c3aa07c5_PA5ozK.json' instead of '/root/.local/share/jupyter/runtime/kernel-75782334-241c-467a-9614-d273c3aa07c5.json'
Signal socket bound to host: 0.0.0.0, port: 48039
JSON Payload '{"stdin_port": 51236, "pgid": "10099", "ip": "0.0.0.0", "pid": "10151", "control_port": 46639, "hb_port": 23160, "signature_scheme": "hmac-sha256", "key": <key>, "comm_port": 48039, "kernel_name": "", "shell_port": 12652, "transport": "tcp", "iopub_port": 39252}
Encrypted Payload 'tjHpvr/C1ehfUlRDRJZDNmD4Y+T+mYzyWrwa4U9hZfFkfl4dk1aTDyKGYTnuBXRIGmsOBeiDkPeSptJ7ud5+TOq54oPgdLoMDZhcI9J3qvPwo0/3U40SDDbUwCmFuSrugJjvxFXf+boCSomXPuO1LMfI5L23KYPOoUArqZZoIT9BYMZ9RZ9HE+yN9AsXu8sHC3g/WMMWCpD819Zt5l0lElt884K44tZQvsZub2JEzVIKzzmxrG3ASZdNNdofJEVVUMMnwTimfnQZSfICEasAcVI0uwTF66fbe7fi+ATy1i4S2fsh2durHZADtMXZLHOg7857pm5DgIrkgQWq0naIJ+p8CxpJ4p7MJ38o46G+yXmTTTwnC+XO/GE6tQ0Oz2T4Xs9b72MavwGlSYuA2xKtSw==
/grid/1/yarn/local/usercache/fk-mlp-user/appcache/application_1543928474139_36333/container_e238_1543928474139_36333_01_000001/mlp.mlc.Linux_debian_8_11.py27.mlc-base-xgboost.tar.gz/lib/python2.7/site-packages/IPython/paths.py:69: UserWarning: IPython parent '/home' is not a writable location, using a temp directory.
  " using a temp directory.".format(parent))
NOTE: When using the `ipython kernel` entry point, Ctrl-C will not work.

To exit, you will have to explicitly quit this process, by either sending
"quit" from a client, or using Ctrl-\ in UNIX-like environments.

To read more about this, see https://github.com/ipython/ipython/issues/2049

To connect another client to this kernel, use:
    --existing /tmp/kernel-75782334-241c-467a-9614-d273c3aa07c5_PA5ozK.json
kevin-bates commented 5 years ago

Is the second log snippet actually STDOUT? This is produced from the launchers and goes to stdout on my YARN cluster.

I'm afraid you're going to need to figure out why the exception is happening since I'm unable to reproduce this. Does this happen on EVERY kernel you start in cluster mode - or only intermittently?

The real issue is a failure occurring in the initial shutdown attempt - which uses the YARN client API to terminate the YARN application.

[W 2018-12-01 19:49:56.618 EnterpriseGatewayApp] Termination of application 'application_1542713397395_139615' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class

So this indicates that YARN is encountering an exception in what looks to be an XML reader class.

amangarg96 commented 5 years ago

The second log snippet is from STDERR in the kernel logs.

And no, it doesn't happen for every kernel. That exception is raised only for the orphan kernels which sometimes show up. Most of the kernels go either into the finished state because of graceful shutdown of notebooks, or get killed by the Gateway server because of the Culling timeout set by me.

kevin-bates commented 5 years ago

Strange. Are you able to determine if this is only occurring on a certain node of your cluster or might it be related to what kinds of things the kernel was doing? A node difference has bit me before.

amangarg96 commented 5 years ago

I did switch to a different node for running my Gateway servers, and I haven't seen this issue since. So I should definitely test whether it was a node issue.

Would a password-less SSH solve this issue of orphan kernels?

kevin-bates commented 5 years ago

It might. It's essentially sending SIGTERM to the process group. If the process is still detected, it sends SIGKILL. I'm assuming the process group is correct (that comes from the launcher that is the target of the spark-submit call.

That said, we should figure out what's going on since the REST api is, by far, the best solution. The ssh approach is merely a fallback.

amangarg96 commented 5 years ago

I tried to reproduce the error using curl, and observed this -

When I used it without header, i.e curl -v -X PUT -d '{"state": "KILLED"}' 'http://<ip>:8088/ws/v1/cluster/apps/application_1544747350136_52218/state', I got the same error which was seen above (Termination of application exception)

*   Trying 10.32.77.243...
* TCP_NODELAY set
* Connected to 10.32.77.243 (10.32.77.243) port 8088 (#0)
> PUT /ws/v1/cluster/apps/application_1544747350136_52218/state HTTP/1.1
> Host: 10.32.77.243:8088
> User-Agent: curl/7.60.0
> Accept: */*
> Content-Length: 19
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 19 out of 19 bytes
< HTTP/1.1 500 Internal Server Error
< Cache-Control: no-cache
< Expires: Tue, 18 Dec 2018 12:22:51 GMT
< Date: Tue, 18 Dec 2018 12:22:51 GMT
< Pragma: no-cache
< Expires: Tue, 18 Dec 2018 12:22:51 GMT
< Date: Tue, 18 Dec 2018 12:22:51 GMT
< Pragma: no-cache
< Content-Type: application/json
< X-Frame-Options: SAMEORIGIN
< Transfer-Encoding: chunked
<
* Connection #0 to host 10.32.77.243 left intact
{"RemoteException":{"exception":"WebApplicationException","javaClassName":"javax.ws.rs.WebApplicationException"}}%

But when I did so with headers, i.e curl -v -X PUT -H "Content-Type: application/json" -d '{"state": "KILLED"}' 'http://10.32.77.243:8088/ws/v1/cluster/apps/application_1544747350136_52218/state', it worked

*   Trying 10.32.77.243...
* TCP_NODELAY set
* Connected to 10.32.77.243 (10.32.77.243) port 8088 (#0)
> PUT /ws/v1/cluster/apps/application_1544747350136_52218/state HTTP/1.1
> Host: 10.32.77.243:8088
> User-Agent: curl/7.60.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 19
>
* upload completely sent off: 19 out of 19 bytes
< HTTP/1.1 202 Accepted
< Cache-Control: no-cache
< Expires: Tue, 18 Dec 2018 12:23:31 GMT
< Date: Tue, 18 Dec 2018 12:23:31 GMT
< Pragma: no-cache
< Expires: Tue, 18 Dec 2018 12:23:31 GMT
< Date: Tue, 18 Dec 2018 12:23:31 GMT
< Pragma: no-cache
< Content-Type: application/json
< X-Frame-Options: SAMEORIGIN
< Location: http://10.32.77.243:8088/ws/v1/cluster/apps/application_1544747350136_52218/state
< Transfer-Encoding: chunked
<
* Connection #0 to host 10.32.77.243 left intact
{"state":"RUNNING"}%

Looking at the requests.put line here, it looks like headers aren't being used. Could this be the issue?

kevin-bates commented 5 years ago

Yes, I think you're on to something.

cc: @lresende