jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Error propagation when kernel killed by YARN #819

Closed amangarg96 closed 4 years ago

amangarg96 commented 4 years ago

Description

In YARN if the containers memory becomes full, it kills the container and the Spark application. In case of remote kernels launched through EG, if the container's memory becomes full - YARN kills the container and there is no error propagation to the jupyterlab server, and to the user.

With EG 1.2.0, the status of the kernel on the UI becomes 'Kernel Dead'; And with EG 2.1.1, kernel is restarted when the container is killed by YARN. Since there is no pop-up or error propagation, it becomes a silent restart for the user.

We have users running Hive queries, distributed pyspark jobs and loading large dataframes in Notebooks, so this has become a frequently occurring issue for us.

Is there a way to propagate the error back to the user?

When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user '). Something similar for this case too would be helpful.

Screenshots / Logs

When Spark application is killed by YARN, this is the log the from YARN UI -

Application application_1588757414349_507688 failed 3 times due to AM Container for appattempt_1588757414349_507688_000003 exited with exitCode: 13
--
  | Failing this attempt.Diagnostics: [2020-06-01 15:16:42.956]Exception from container-launch.
  | Container id: container_e2407_1588757414349_507688_03_000001
  | Exit code: 13
  | [2020-06-01 15:16:42.957]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
  | Last 4096 bytes of prelaunch.err :
  | Last 4096 bytes of stderr :
  | rn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
  | 20/06/01 15:16:40 INFO SecurityManager: Changing view acls to: yarn,aman.garg
  | 20/06/01 15:16:40 INFO SecurityManager: Changing modify acls to: yarn,aman.garg
  | 20/06/01 15:16:40 INFO SecurityManager: Changing view acls groups to:
  | 20/06/01 15:16:40 INFO SecurityManager: Changing modify acls groups to:
  | 20/06/01 15:16:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, aman.garg); groups with view permissions: Set(); users with modify permissions: Set(yarn, aman.garg); groups with modify permissions: Set()
  | 20/06/01 15:16:40 INFO ApplicationMaster: Preparing Local resources
  | 20/06/01 15:16:41 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: dfs.datanode.failed.volumes.tolerated; Ignoring.
  | 20/06/01 15:16:41 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1588757414349_507688_000003
  | 20/06/01 15:16:41 INFO ApplicationMaster: Starting the user application in a separate Thread
  | 20/06/01 15:16:41 INFO ApplicationMaster: Waiting for spark context initialization...
  | 20/06/01 15:16:41 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
  | 20/06/01 15:16:41 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
  | 20/06/01 15:16:42 ERROR ApplicationMaster: User application exited with status 1
  | 20/06/01 15:16:42 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1)
  | 20/06/01 15:16:42 ERROR ApplicationMaster: Uncaught exception:
  | org.apache.spark.SparkException: Exception thrown in awaitResult:
  | at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
  | at java.security.AccessController.doPrivileged(Native Method)
  | at javax.security.auth.Subject.doAs(Subject.java:422)
  | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
  | Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
  | at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:106)
  | at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
  | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  | at java.lang.reflect.Method.invoke(Method.java:498)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
  | 20/06/01 15:16:42 INFO ApplicationMaster: Deleting staging directory hdfs://bheema/user/aman.garg/.sparkStaging/application_1588757414349_507688
  | 20/06/01 15:16:42 INFO ShutdownHookManager: Shutdown hook called
  | [2020-06-01 15:16:42.957]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
  | Last 4096 bytes of prelaunch.err :
  | Last 4096 bytes of stderr :
  | rn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
  | 20/06/01 15:16:40 INFO SecurityManager: Changing view acls to: yarn,aman.garg
  | 20/06/01 15:16:40 INFO SecurityManager: Changing modify acls to: yarn,aman.garg
  | 20/06/01 15:16:40 INFO SecurityManager: Changing view acls groups to:
  | 20/06/01 15:16:40 INFO SecurityManager: Changing modify acls groups to:
  | 20/06/01 15:16:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, aman.garg); groups with view permissions: Set(); users with modify permissions: Set(yarn, aman.garg); groups with modify permissions: Set()
  | 20/06/01 15:16:40 INFO ApplicationMaster: Preparing Local resources
  | 20/06/01 15:16:41 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: dfs.datanode.failed.volumes.tolerated; Ignoring.
  | 20/06/01 15:16:41 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1588757414349_507688_000003
  | 20/06/01 15:16:41 INFO ApplicationMaster: Starting the user application in a separate Thread
  | 20/06/01 15:16:41 INFO ApplicationMaster: Waiting for spark context initialization...
  | 20/06/01 15:16:41 WARN SparkConf: The configuration key 'spark.yarn.executor.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.executor.memoryOverhead' instead.
  | 20/06/01 15:16:41 WARN SparkConf: The configuration key 'spark.yarn.driver.memoryOverhead' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.driver.memoryOverhead' instead.
  | 20/06/01 15:16:42 ERROR ApplicationMaster: User application exited with status 1
  | 20/06/01 15:16:42 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 1)
  | 20/06/01 15:16:42 ERROR ApplicationMaster: Uncaught exception:
  | org.apache.spark.SparkException: Exception thrown in awaitResult:
  | at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
  | at java.security.AccessController.doPrivileged(Native Method)
  | at javax.security.auth.Subject.doAs(Subject.java:422)
  | at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
  | at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
  | Caused by: org.apache.spark.SparkUserAppException: User application exited with 1
  | at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:106)
  | at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
  | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  | at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  | at java.lang.reflect.Method.invoke(Method.java:498)
  | at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
  | 20/06/01 15:16:42 INFO ApplicationMaster: Deleting staging directory hdfs://bheema/user/aman.garg/.sparkStaging/application_1588757414349_507688
  | 20/06/01 15:16:42 INFO ShutdownHookManager: Shutdown hook called
  | For more detailed output, check the application tracking page: http://prod-fdphadoop-bheema-rm-0001:8088/cluster/app/application_1588757414349_507688 Then click on links to logs of each attempt.
  | . Failing the application.

Environment

kevin-bates commented 4 years ago

Hi @amangarg96 - thanks for opening this issue - another difficult issue, but an important one.

I suspect the difference in releases here might be in how auto-restart is detected and handled. My thought is that perhaps in 1.x it wasn't handled "properly" while in 2.x it is. However, in this case, perhaps we should not handle auto-restarts at all?

As you know, the framework polls every 3 seconds for the kernel process's existence. On YARN, this is probably a status call via the API against the application. However, that polling occurs from an event loop and isn't tied to a client-side request, so I suspect the best that could happen would be for the client to discover the "dead kernel" and report it that way.

I think we may want to disable auto-restarts for YARN kernels altogether. Not sure about the other process proxy's - although the IBM Spectrum Conductor may want to follow suit (@kjdoyle).

When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ').

Can you elaborate on what actually happens here? I'm having trouble understanding why this too wouldn't be viewed as 'kernel died, trigger auto restart'. A traceback in the EG log would be helpful to see how its getting handled (and propagated).

Regarding bandwidth - my time is extremely limited and it seems I'm about the only maintainer dealing with this repository, so I won't be able to look into this. Also, I'll be building the next Notebook release and plan to build EG 2.2 shortly after, so if we could try to address this soon, that would be ideal. I would be happy to guide you should you need assistance.

kjdoyle commented 4 years ago

@kevin-bates the ability to do auto restarts would be good if it is working or we can get working. I will need to do more testing in Conductor as we have a concept of restarting the driver(kernel) if it fails up to 3 times.

amangarg96 commented 4 years ago

When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ').

In this, with EG 1.2, if we have a running kernel and if I kill the spark job from the YARN UI, then the error is propagated back to the user in the form of a pop-up, which ends with 'Application KILLED by user ' . I'll try to reproduce it and share the screenshot

And if I do the same with EG 2.1, the EG does an automatic restart of the kernel. Silently.

And yes, maybe we should have a configurable way to enable/disable auto-restarts. My focus would be on propagating the error if the auto-restart is disabled, and the application is killed due to reasons like driver(kernel) running OOM.

kevin-bates commented 4 years ago

I don't think we can have it both ways. If the restarts occur, they will be silent. If they do not occur, we should (because we have) surface that indication back to the user - although I think its more likely the case that that frontend discovers the kernel is dead. That said, I'm open to reviewing a better solution.

@amangarg96 - Your issue stems from OOM. Is that something that can be configured into your kernelspecs for now?

amangarg96 commented 4 years ago

Your issue stems from OOM. Is that something that can be configured into your kernelspecs for now?

What configurations can be set for this? I'm aware of spark.driver.maxResultSize, but that is useful only for restricting the size of the data coming from the executors to the driver. I'm not aware of configs preventing OOMs due to say loading large csv files as pandas dataframe.

Could you point me to the configurations?

amangarg96 commented 4 years ago

When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ')

Regarding this, I'm sorry but I remembered it incorrectly. What I was thinking about, was the case when the kernel launch fails. When a kernel launch is request, and the spark application is killed (from YARN UI) before the kernel has been launched, then there is a 'Error Starting Kernel' pop up -

Traceback (most recent call last):
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/web.py", line 1592, in _execute
    result = yield result
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/notebook/services/sessions/handlers.py", line 73, in post
    type=mtype))
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 397, in create_session
    session_id, path, name, type, kernel_name,
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 92, in start_kernel_for_session
    self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 156, in start_kernel
    response = yield fetch_kg(self.kernels_endpoint, method='POST', body=json_body)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 67, in fetch_kg
    response = yield client.fetch(url, **kwargs)
  File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.httpclient.HTTPClientError: HTTP 500: KernelID: 'e5091a99-a887-47ab-a6bc-d005b89f3f8f', ApplicationID: 'application_1588757414349_587427' unexpectedly found in state 'KILLED' during kernel startup!

With EG 1.2.0 too, the kernel gets restarted if it gets killed due to OOM (or explicit kill of spark application from Yarn UI).

kevin-bates commented 4 years ago

Ok - thanks for the update. So just to clarify - are you finding the two EG releases behaving similarly?

Can you try experimenting with running EG with this command-line (or config) option: --KernelRestarter.restart_limit=0? (I believe '1' will do the same thing.). This will monitor for the kernel process's death, but not perform a restart. I'm finding it better than --KernelManager.autorestart=False because this disables to the process polling altogether - so the application doesn't really know the kernel has died.

If we find that restart_limit=0 is sufficient, we could then look into how we might go about setting this option on a per kernelspec or perhaps process-proxy basis.

Regarding the adjustment of Spark parameters relative to memory, there appear to be a few options for driver and worker memory, JVM options, etc. See https://spark.apache.org/docs/latest/configuration.html#application-properties and https://spark.apache.org/docs/latest/configuration.html#runtime-environment.

If we need more tuning advice, I can consult with my Spark colleagues, but we should make sure we've checked our options prior.

kevin-bates commented 4 years ago

I have an implementation where this could be configured via the process-proxy config stanza - making this a per kernelspec configurable option. The tricky part is that the restart_limit must be set after the kernel has started. (I hooked the post_start_kernel() method for this.)

A couple more items.

  1. When the restart_limit is set to 0, the logging and messaging imply that auto-restarts have failed simply because it's been "exhausted" (even though it was never attempted).
  2. In Notebook the message is as follows: image While in Jupyter Lab the kernel name switches to "No kernel!" and nothing more.

I like the Notebook behavior in that it still gives the user to attempt to restart - in which case that restart will succeed. I think Lab users would just need to know that 'No kernel!' means the kernel died.

Its too bad KernelManager.autorestart=False doesn't still monitor for the process's death and the messaging is poor when restart_limit=0.

amangarg96 commented 4 years ago

So just to clarify - are you finding the two EG releases behaving similarly?

Yes, both the EG releases are behaving similarly.

I have an implementation where this could be configured via the process-proxy config stanza

I am not familiar with the process-proxy config. What are these configs being used for? Why is it required to set the --KernelRestarter.restart_limit=0 after the kernel has been launched?

the messaging is poor when restart_limit=0.

I tried setting the --KernelRestarter.restart_limit=0 and using it with JupyterLab. I observed the same behaviour on the Lab UI (Kernel state switching to "No kernel!"). On Enterprise Gateway, I saw the following log -

[W 2020-06-10 19:10:34.858 EnterpriseGatewayApp] KernelRestarter: restart failed
[W 2020-06-10 19:10:34.858 EnterpriseGatewayApp] Kernel 619914fd-3f82-4d44-95b7-ef013414c9af died, removing from map.
[E 200610 19:10:34 handlers:492] kernel 619914fd-3f82-4d44-95b7-ef013414c9af restarted failed!

Would it be a good idea, to raise an exception when the restart by KernelRestarter fails? something like self.log_and_raise(http_status_code=404, reason="Kernel found in dead state, and KernelRestart limit reached!")

kevin-bates commented 4 years ago

I am not familiar with the process-proxy config. What are these configs being used for?

I think this capability may have been added in 2.0, but the process_proxy stanza can be extended with additional configuration options that would then apply on a per-kernel basis. Things like authorized_users can be added such that only those users can run the given kernel, port-ranges could be specified, as well as yarn endpoints, etc.

So I essentially added the ability to specify the restart limit on a per-kernel basis. This way, if you know a given kernel (based on its memory requirements, etc.) is subject to failures that should not result in automatic restarts, you could indicate that via this approach:

  "metadata": {
    "process_proxy": {
      "class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy",
      "config": {
          "restart_limit": 0
      }
    }
  }

Why is it required to set the --KernelRestarter.restart_limit=0 after the kernel has been launched?

The restarter isn't started until after the kernel has started. As a result, the class instance doesn't check its restart_limit until then. So, if we only wanted this to apply to specific kernels (via the process-proxy config), we need to set the restart_limit on the specific class instance and we do this by overriding post_start_kernel().

Would it be a good idea, to raise an exception when the restart by KernelRestarter fails? something like self.log_and_raise(http_status_code=404, reason="Kernel found in dead state, and KernelRestart limit reached!")

The auto-restarts are performed by a periodic poll task that checks if the kernel process is still functioning every 3 seconds and resides deep in jupyter_client.

The ZMQChannelsHandler in Notebook is the entity that detects the restart failed and sends a status message of 'dead' - which seems like the right thing to do. How the front ends interpret this is another matter.

Since this would require multiple layers of changes or another wave of incorporating code directly into EG, I'm not sure its worth the effort.

amangarg96 commented 4 years ago

Hey Kevin,

I was going through the relevant issues on JupyterLab and jupyter_client repo, and found this PR. In this, the user is notified only when the auto-restart in jupyter_client is triggered.

image

While in Jupyter Lab the kernel name switches to "No kernel!" and nothing more.

We didn't get the pop-up with JupyterLab since we had set --KernelRestarter.restart_limit=0. I tested with JupyterLab version 2.1.4 by setting a non-zero restart limit and it works 😄

kevin-bates commented 4 years ago

I'm not sure what your point is. The issue is that we'd like to avoid auto-restarts altogether AND be notified - in any manner - that the kernel has died. Could you please clarify what this comment is driving at?

Setting that option shouldn't result in a change in front-end behavior, so I'm a little confused. Since the server-side code uses >= restart-limit to determine that it should NOT continue auto-restarts, I'm fairly certain you'll see the same behavior using --KernelRestarter.restart_limit=1.

amangarg96 commented 4 years ago

The issue is that we'd like to avoid auto-restarts altogether AND be notified

The main reason I wanted to avoid auto-restarts was because the user was not getting notified that the kernel has died, and the kernel is getting restarted (silent restarts). With this user notification for the auto-restart getting triggered, the user will know his kernel died and he will be aware of losing the kernel session.

kevin-bates commented 4 years ago

I see - thank you. Are you testing this by issuing the kill request from the YARN application UI as well?

kevin-bates commented 4 years ago

Actually, I just noticed how recent the lab PR was! I was thinking that had been around for some time then realized 2.1 is relatively new.

Nice find. I suspect we can go ahead and close this issue then - is that your understanding?

amangarg96 commented 4 years ago

I'm fairly certain you'll see the same behavior using --KernelRestarter.restart_limit=1

In my observations, when I set restart limit to 1, there is no user notification. The pop-up comes up only if the kernel is in 'autostarting' state

If we set the restart_limit to 1, the Kernel restart fails before the kernel goes to autostarting state, and hence there is no user notification In the Enterprise Gateway logs, it shows [W 2020-06-11 22:40:28.542 EnterpriseGatewayApp] KernelRestarter: restart failed

amangarg96 commented 4 years ago

Are you testing this by issuing the kill request from the YARN application UI as well?

Yes, killing the spark application from the YARN UI is also handled. The KernelRestarter is triggered in that case as well

amangarg96 commented 4 years ago

I suspect we can go ahead and close this issue then - is that your understanding?

Yeah, I think we can close this issue for now. Since it's difficult to catch the stack trace related to OOM.

Thanks Kevin for all the help. Cheers!

kevin-bates commented 4 years ago

Awesome - thanks @amangarg96. We're getting close to having a pretty cool EG 2.2 release! Loving the async kernel management stuff!

amangarg96 commented 4 years ago

Async kernel management will be huge! Our team is eagerly waiting to take it for a spin 😄