Closed amangarg96 closed 4 years ago
Hi @amangarg96 - thanks for opening this issue - another difficult issue, but an important one.
I suspect the difference in releases here might be in how auto-restart is detected and handled. My thought is that perhaps in 1.x it wasn't handled "properly" while in 2.x it is. However, in this case, perhaps we should not handle auto-restarts at all?
As you know, the framework polls every 3 seconds for the kernel process's existence. On YARN, this is probably a status call via the API against the application. However, that polling occurs from an event loop and isn't tied to a client-side request, so I suspect the best that could happen would be for the client to discover the "dead kernel" and report it that way.
I think we may want to disable auto-restarts for YARN kernels altogether. Not sure about the other process proxy's - although the IBM Spectrum Conductor may want to follow suit (@kjdoyle).
When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ').
Can you elaborate on what actually happens here? I'm having trouble understanding why this too wouldn't be viewed as 'kernel died, trigger auto restart'. A traceback in the EG log would be helpful to see how its getting handled (and propagated).
Regarding bandwidth - my time is extremely limited and it seems I'm about the only maintainer dealing with this repository, so I won't be able to look into this. Also, I'll be building the next Notebook release and plan to build EG 2.2 shortly after, so if we could try to address this soon, that would be ideal. I would be happy to guide you should you need assistance.
@kevin-bates the ability to do auto restarts would be good if it is working or we can get working. I will need to do more testing in Conductor as we have a concept of restarting the driver(kernel) if it fails up to 3 times.
When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ').
In this, with EG 1.2, if we have a running kernel and if I kill the spark job from the YARN UI, then the error is propagated back to the user in the form of a pop-up, which ends with 'Application
And if I do the same with EG 2.1, the EG does an automatic restart of the kernel. Silently.
And yes, maybe we should have a configurable way to enable/disable auto-restarts. My focus would be on propagating the error if the auto-restart is disabled, and the application is killed due to reasons like driver(kernel) running OOM.
I don't think we can have it both ways. If the restarts occur, they will be silent. If they do not occur, we should (because we have) surface that indication back to the user - although I think its more likely the case that that frontend discovers the kernel is dead. That said, I'm open to reviewing a better solution.
@amangarg96 - Your issue stems from OOM. Is that something that can be configured into your kernelspecs for now?
Your issue stems from OOM. Is that something that can be configured into your kernelspecs for now?
What configurations can be set for this? I'm aware of spark.driver.maxResultSize, but that is useful only for restricting the size of the data coming from the executors to the driver. I'm not aware of configs preventing OOMs due to say loading large csv files as pandas dataframe.
Could you point me to the configurations?
When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user ')
Regarding this, I'm sorry but I remembered it incorrectly. What I was thinking about, was the case when the kernel launch fails. When a kernel launch is request, and the spark application is killed (from YARN UI) before the kernel has been launched, then there is a 'Error Starting Kernel' pop up -
Traceback (most recent call last):
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/web.py", line 1592, in _execute
result = yield result
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/notebook/services/sessions/handlers.py", line 73, in post
type=mtype))
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 397, in create_session
session_id, path, name, type, kernel_name,
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/notebook/services/sessions/sessionmanager.py", line 92, in start_kernel_for_session
self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 156, in start_kernel
response = yield fetch_kg(self.kernels_endpoint, method='POST', body=json_body)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/nb2kg/managers.py", line 67, in fetch_kg
response = yield client.fetch(url, **kwargs)
File "/Users/aman.garg/Downloads/hunch_bundle/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
tornado.httpclient.HTTPClientError: HTTP 500: KernelID: 'e5091a99-a887-47ab-a6bc-d005b89f3f8f', ApplicationID: 'application_1588757414349_587427' unexpectedly found in state 'KILLED' during kernel startup!
With EG 1.2.0 too, the kernel gets restarted if it gets killed due to OOM (or explicit kill of spark application from Yarn UI).
Ok - thanks for the update. So just to clarify - are you finding the two EG releases behaving similarly?
Can you try experimenting with running EG with this command-line (or config) option: --KernelRestarter.restart_limit=0
? (I believe '1' will do the same thing.). This will monitor for the kernel process's death, but not perform a restart. I'm finding it better than --KernelManager.autorestart=False
because this disables to the process polling altogether - so the application doesn't really know the kernel has died.
If we find that restart_limit=0
is sufficient, we could then look into how we might go about setting this option on a per kernelspec or perhaps process-proxy basis.
Regarding the adjustment of Spark parameters relative to memory, there appear to be a few options for driver and worker memory, JVM options, etc. See https://spark.apache.org/docs/latest/configuration.html#application-properties and https://spark.apache.org/docs/latest/configuration.html#runtime-environment.
If we need more tuning advice, I can consult with my Spark colleagues, but we should make sure we've checked our options prior.
I have an implementation where this could be configured via the process-proxy config stanza - making this a per kernelspec configurable option. The tricky part is that the restart_limit must be set after the kernel has started. (I hooked the post_start_kernel()
method for this.)
A couple more items.
"No kernel!"
and nothing more.I like the Notebook behavior in that it still gives the user to attempt to restart - in which case that restart will succeed. I think Lab users would just need to know that 'No kernel!' means the kernel died.
Its too bad KernelManager.autorestart=False doesn't still monitor for the process's death and the messaging is poor when restart_limit=0.
So just to clarify - are you finding the two EG releases behaving similarly?
Yes, both the EG releases are behaving similarly.
I have an implementation where this could be configured via the process-proxy config stanza
I am not familiar with the process-proxy config. What are these configs being used for? Why is it required to set the --KernelRestarter.restart_limit=0
after the kernel has been launched?
the messaging is poor when restart_limit=0.
I tried setting the --KernelRestarter.restart_limit=0
and using it with JupyterLab. I observed the same behaviour on the Lab UI (Kernel state switching to "No kernel!"
). On Enterprise Gateway, I saw the following log -
[W 2020-06-10 19:10:34.858 EnterpriseGatewayApp] KernelRestarter: restart failed
[W 2020-06-10 19:10:34.858 EnterpriseGatewayApp] Kernel 619914fd-3f82-4d44-95b7-ef013414c9af died, removing from map.
[E 200610 19:10:34 handlers:492] kernel 619914fd-3f82-4d44-95b7-ef013414c9af restarted failed!
Would it be a good idea, to raise an exception when the restart by KernelRestarter fails?
something like self.log_and_raise(http_status_code=404, reason="Kernel found in dead state, and KernelRestart limit reached!")
I am not familiar with the process-proxy config. What are these configs being used for?
I think this capability may have been added in 2.0, but the process_proxy
stanza can be extended with additional configuration options that would then apply on a per-kernel basis. Things like authorized_users can be added such that only those users can run the given kernel, port-ranges could be specified, as well as yarn endpoints, etc.
So I essentially added the ability to specify the restart limit on a per-kernel basis. This way, if you know a given kernel (based on its memory requirements, etc.) is subject to failures that should not result in automatic restarts, you could indicate that via this approach:
"metadata": {
"process_proxy": {
"class_name": "enterprise_gateway.services.processproxies.yarn.YarnClusterProcessProxy",
"config": {
"restart_limit": 0
}
}
}
Why is it required to set the --KernelRestarter.restart_limit=0 after the kernel has been launched?
The restarter isn't started until after the kernel has started. As a result, the class instance doesn't check its restart_limit
until then. So, if we only wanted this to apply to specific kernels (via the process-proxy config), we need to set the restart_limit on the specific class instance and we do this by overriding post_start_kernel()
.
Would it be a good idea, to raise an exception when the restart by KernelRestarter fails? something like self.log_and_raise(http_status_code=404, reason="Kernel found in dead state, and KernelRestart limit reached!")
The auto-restarts are performed by a periodic poll task that checks if the kernel process is still functioning every 3 seconds and resides deep in jupyter_client.
The ZMQChannelsHandler in Notebook is the entity that detects the restart failed and sends a status message of 'dead' - which seems like the right thing to do. How the front ends interpret this is another matter.
Since this would require multiple layers of changes or another wave of incorporating code directly into EG, I'm not sure its worth the effort.
Hey Kevin,
I was going through the relevant issues on JupyterLab and jupyter_client repo, and found this PR. In this, the user is notified only when the auto-restart in jupyter_client is triggered.
While in Jupyter Lab the kernel name switches to "No kernel!" and nothing more.
We didn't get the pop-up with JupyterLab since we had set --KernelRestarter.restart_limit=0
. I tested with JupyterLab version 2.1.4 by setting a non-zero restart limit and it works 😄
I'm not sure what your point is. The issue is that we'd like to avoid auto-restarts altogether AND be notified - in any manner - that the kernel has died. Could you please clarify what this comment is driving at?
Setting that option shouldn't result in a change in front-end behavior, so I'm a little confused. Since the server-side code uses >= restart-limit to determine that it should NOT continue auto-restarts, I'm fairly certain you'll see the same behavior using --KernelRestarter.restart_limit=1
.
The issue is that we'd like to avoid auto-restarts altogether AND be notified
The main reason I wanted to avoid auto-restarts was because the user was not getting notified that the kernel has died, and the kernel is getting restarted (silent restarts). With this user notification for the auto-restart getting triggered, the user will know his kernel died and he will be aware of losing the kernel session.
I see - thank you. Are you testing this by issuing the kill request from the YARN application UI as well?
Actually, I just noticed how recent the lab PR was! I was thinking that had been around for some time then realized 2.1 is relatively new.
Nice find. I suspect we can go ahead and close this issue then - is that your understanding?
I'm fairly certain you'll see the same behavior using --KernelRestarter.restart_limit=1
In my observations, when I set restart limit to 1, there is no user notification. The pop-up comes up only if the kernel is in 'autostarting' state
If we set the restart_limit to 1, the Kernel restart fails before the kernel goes to autostarting state, and hence there is no user notification
In the Enterprise Gateway logs, it shows [W 2020-06-11 22:40:28.542 EnterpriseGatewayApp] KernelRestarter: restart failed
Are you testing this by issuing the kill request from the YARN application UI as well?
Yes, killing the spark application from the YARN UI is also handled. The KernelRestarter is triggered in that case as well
I suspect we can go ahead and close this issue then - is that your understanding?
Yeah, I think we can close this issue for now. Since it's difficult to catch the stack trace related to OOM.
Thanks Kevin for all the help. Cheers!
Awesome - thanks @amangarg96. We're getting close to having a pretty cool EG 2.2 release! Loving the async kernel management stuff!
Async kernel management will be huge! Our team is eagerly waiting to take it for a spin 😄
Description
In YARN if the containers memory becomes full, it kills the container and the Spark application. In case of remote kernels launched through EG, if the container's memory becomes full - YARN kills the container and there is no error propagation to the jupyterlab server, and to the user.
With EG 1.2.0, the status of the kernel on the UI becomes 'Kernel Dead'; And with EG 2.1.1, kernel is restarted when the container is killed by YARN. Since there is no pop-up or error propagation, it becomes a silent restart for the user.
We have users running Hive queries, distributed pyspark jobs and loading large dataframes in Notebooks, so this has become a frequently occurring issue for us.
Is there a way to propagate the error back to the user?
When an explicit kill signal for spark application is sent to YARN, I have seen that an error is raised on Notebooks ('Application has been killed by user '). Something similar for this case too would be helpful.
Screenshots / Logs
When Spark application is killed by YARN, this is the log the from YARN UI -
Environment