Spark driver and executor pods staying alive even after kernel switch to new kernel

vivekbhide commented 2 years ago

Description

While using Spark Operator (Python) kernel from EG, the Spark driver and executor pods continues to stay alive even after kernel is switched to different one in corresponding notebook. This don't seem to be case with other Kubernetes kernels like Python on Kubernetes where switching to a different kernel kills the pod for old kernel

Steps to Reproduce

Setup -

Create a duplicate Spark Operator (Python) called Custom Spark Operator (Python) on EG pod. Content of the kernel directory are identical. Purpose is to be able to switch between similar kernels which could very well be the real life scenario (two spark kernels with different set of libraries)

Test Scenario -

Create a Notebook with spark code
Select kernel Custom Spark Operator (Python)
Execute the notebook.
Switch the kernel to Spark Operator (Python)

Expected result -

Driver and executor pods for Custom Spark Operator (Python) should get deleted (meaning executor pods are deleted and driver pod is transitioned to succeeded and hence will be cleaned by garbage collection or manual removal) and a new driver pod is initiated for Spark Operator (Python)

Actual result -

Pod for Custom Spark Operator (Python) continue to be in running state with new set of pod getting initiated for Spark Operator (Python)
Driver pod for Custom Spark Operator (Python) shows the shutdown signal log but doesn’t kill the pods

22/10/04 18:03:31 INFO DAGScheduler: Job 7 is finished. Cancelling potential speculative or zombie tasks for this job
22/10/04 18:03:31 INFO TaskSchedulerImpl: Killing all running tasks in stage 7: Stage finished
22/10/04 18:03:31 INFO DAGScheduler: Job 7 finished: showString at NativeMethodAccessorImpl.java:0, took 0.124389 s
22/10/04 18:03:31 INFO CodeGenerator: Code generated in 10.286197 ms
22/10/04 18:09:39 INFO BlockManagerInfo: Removed broadcast_7_piece0 on spark-6898f583a42a0257-driver-svc.enterprise-gateway.svc:7079 in memory (size: 5.2 KiB, free: 413.9 MiB)
22/10/04 18:09:39 INFO BlockManagerInfo: Removed broadcast_7_piece0 on 172.18.24.199:39389 in memory (size: 5.2 KiB, free: 413.9 MiB)
[I 2022-10-04 18 (tel:2022100418):24:20,377.377 launch_ipykernel] server_listener got request: {'signum': 2}
[I 2022-10-04 18 (tel:2022100418):24:20,377.377 launch_ipykernel] server_listener got request: {'shutdown': 1}

Observations -

If there is only a driver pod, then the pod gets marked as succeeded if the kernel is switched. No matter if it had the executor pods previous attached for doing any tasks. This can be tested by invoking spark.stop() at the end of the notebook. spark.stop() call is responsible for killing the executor pods.
If there are executor pods still running then kernel switch just results in creating a new driver pod for new kernel with older kernel driver and executor pods now in zombie state

welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

kevin-bates commented 2 years ago

@vivekbhide - thanks for opening the issue and the previous discussion on our Gitter room.

Since these process proxies derive directly from KubernetesProcessProxy, it's strange that there's a difference in behavior like this. Looking at the CustomResourceProcessProxy, however, I do see that its implementation of kill() should probably be named terminate_container_resources(), and override the implementation in KubernetesProcessProxy entirely. This is because the kill() method is not always called during the termination of a "kernel". Only when the message-based shutdown of the kernel does not yield results soon enough is the kill() method called. In those "graceful" cases, the container resources are terminated during the cleanup() method - which the listener is shutdown, etc.

I suppose this would imply that the normal shutdown scenario is not occurring in a "graceful" manner and, in those cases, kill() is called, otherwise we'd see a CRD leak there as well. Not entirely sure about that theory, but we should definitely rename kill() to terminate_container_resources() and see where that takes us.

@vivekbhide - are you in a position to give that a try? If so, I'd like to just include the implementation of terminate_custom_resource() into the terminate_container_resources() (formerly kill()) implementation directly since there isn't another caller of it and it will be confusing having two nearly identically named methods.

vivekbhide commented 2 years ago

@kevin-bates - Thank you for analysis. Let me give it a try to see how it behaves. If I understand correctly, then this is what you want me try

renamekill() from CustomResourceProcessProxy to terminate_container_resources()
include the implementation of terminate_custom_resource() into newly renamed terminate_container_resources()

Will check and update you on the result.

kevin-bates commented 2 years ago

Yes, that is correct. The second item is more about aesthetics than anything else, so we should have an idea if this is the right track from item 1.

vivekbhide commented 2 years ago

@kevin-bates - got it.. and I believe following https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/devinstall.html would be a path to follow for trying these changes ?

kevin-bates commented 2 years ago

Yes. I think make clean dist enterprise-gateway should do the trick. The last of these is what builds the image. If you find hassles arise due to scala dependencies, let me know and I can build an EG image from your branch if necessary.

vivekbhide commented 2 years ago

@kevin-bates - Just a update.. Made the changes, built the EG on local. Able to deploy it on local docker desktop but when trying to launch spark operator kernel, I am getting below error. Checking the root cause.. Let me know if you have pointers I am not sure if this is related to suggested changes

[W 2022-10-05 14:23:56.283 EnterpriseGatewayApp] Shared namespace has been configured.  All kernels will reside in EG namespace: enterprise-gateway
[I 2022-10-05 14:23:56.298 EnterpriseGatewayApp] SparkOperatorProcessProxy: kernel launched. Kernel image: elyra/kernel-spark-py:dev, KernelID: 9aa71f30-e4c7-4109-84ff-db199050d0c0, cmd: '['/opt/conda/bin/python3.9', '/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py', '--RemoteProcessProxy.kernel-id', '9aa71f30-e4c7-4109-84ff-db199050d0c0', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.response-address', '10.1.0.9:8877', '--RemoteProcessProxy.public-key', 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDF+Lan9asAQdZ6xgoD2XWQOB8Y/76nDlb+hrL3Dkeb0lhquMTV+TOkPXJx2i+eFBY8IZv8awTtbAMMX4fZuaRChtlBnxg2CXvVcNSAzYJwzLIbjf0Y/TCz4L4xrvQQR7f7H0sMu/rmUTKUeTSy9JCWeXMnPwx5drEiPUFRxgPMdwIDAQAB']'
Traceback (most recent call last):
  File "/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py", line 114, in <module>
    launch_custom_resource_kernel(
  File "/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py", line 61, in launch_custom_resource_kernel
    client.CustomObjectsApi().create_namespaced_custom_object(
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
    return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
    return self.request("POST", url,
  File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '56a44f9c-afef-42d9-bd17-4fe6faf9cdd3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '8eb8854c-8dfa-4ca6-9815-6905ae679fc3', 'X-Kubernetes-Pf-Prioritylevel-Uid': '97bf8420-edf0-4f30-a734-f589ee14c6d3', 'Date': 'Wed, 05 Oct 2022 14:23:56 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found

kevin-bates commented 2 years ago

This should not be related since it's happening during the launch (unless this is the launcher of the "changed-to" kernel I suppose).

I see you're using the shared namespace (EG's namespace). Is that also the case in the original issue?

I'm sorry, but I'm not familiar with CRDs so will need to come up to speed myself. Are there things that need to happen on the k8s server to enable these process proxies?

The issue immediately above seems like a configuration kind of thing (404 on the internal create_namespaced_custom_object_with_http_info() method).

kevin-bates commented 2 years ago

@vivekbhide - your issue is because you must not have the Spark Operator installed in your k8s config, or its not setup to monitor all namespaces. If I use these instructions to add a spark-operator namespace, then I can get past the 404 issue (which I reproduced prior to running the two helm commands).

We need to improve our docs in this area! At a minimum, add a link to the "Note" in this section: https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/system-architecture.html#process-proxy

vivekbhide commented 2 years ago

@kevin-bates - thanks.. You were right.. I was missing the spark operator and after installing I got past that error too. I will retest the scenario and let you know my result.

vivekbhide commented 2 years ago

@kevin-bates - Implementing suggested changes have resolved this issue. I could confirm that, on switching the kernel, the old spark executor pods get deleted, old driver pod goes to succeeded state and a new driver pod gets active. Please note that I continue to have terminate_custom_resource() as I haven't tested any other scenario and not fully aware of its scope

kevin-bates commented 2 years ago

Excellent - thank you for checking that out!

Regarding terminate_custom_resource(), it's only used within crd.py and can probably be prefixed with _ (although I'd prefer it just be incorporated into the terminate_container_resources().

I think we should also update the docs with a link to installing the spark operator and try to detect the issue (and suggest its installation) by trapping that error during the CRD launch - but these can be a separate issue.

I'm away from work yesterday and today (and thru the weekend) but will try to create an issue for this additional work soon.

kevin-bates commented 1 year ago

@vivekbhide - are you planning on submitting a pull request to address this issue? If so, it would be appreciated soon. Thank you.

vivekbhide commented 1 year ago

@kevin-bates - Hi Kevin, Sorry but got held up with something so couldn't do it before. Looks like @kiersten-stokes has already created the PR for this fix so would keep it that way. Thank you @kiersten-stokes

kevin-bates commented 1 year ago

Excellent - thank you both!

jupyter-server / enterprise_gateway