Closed vivekbhide closed 1 year ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
@vivekbhide - thanks for opening the issue and the previous discussion on our Gitter room.
Since these process proxies derive directly from KubernetesProcessProxy
, it's strange that there's a difference in behavior like this. Looking at the CustomResourceProcessProxy
, however, I do see that its implementation of kill()
should probably be named terminate_container_resources()
, and override the implementation in KubernetesProcessProxy
entirely. This is because the kill()
method is not always called during the termination of a "kernel". Only when the message-based shutdown of the kernel does not yield results soon enough is the kill()
method called. In those "graceful" cases, the container resources are terminated during the cleanup()
method - which the listener is shutdown, etc.
I suppose this would imply that the normal shutdown scenario is not occurring in a "graceful" manner and, in those cases, kill()
is called, otherwise we'd see a CRD leak there as well. Not entirely sure about that theory, but we should definitely rename kill()
to terminate_container_resources()
and see where that takes us.
@vivekbhide - are you in a position to give that a try? If so, I'd like to just include the implementation of terminate_custom_resource()
into the terminate_container_resources()
(formerly kill()
) implementation directly since there isn't another caller of it and it will be confusing having two nearly identically named methods.
@kevin-bates - Thank you for analysis. Let me give it a try to see how it behaves. If I understand correctly, then this is what you want me try
kill()
from CustomResourceProcessProxy to terminate_container_resources()
terminate_custom_resource()
into newly renamed terminate_container_resources()
Will check and update you on the result.
Yes, that is correct. The second item is more about aesthetics than anything else, so we should have an idea if this is the right track from item 1.
@kevin-bates - got it.. and I believe following https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/devinstall.html would be a path to follow for trying these changes ?
Yes. I think make clean dist enterprise-gateway
should do the trick. The last of these is what builds the image. If you find hassles arise due to scala dependencies, let me know and I can build an EG image from your branch if necessary.
@kevin-bates - Just a update.. Made the changes, built the EG on local. Able to deploy it on local docker desktop but when trying to launch spark operator kernel, I am getting below error. Checking the root cause.. Let me know if you have pointers I am not sure if this is related to suggested changes
[W 2022-10-05 14:23:56.283 EnterpriseGatewayApp] Shared namespace has been configured. All kernels will reside in EG namespace: enterprise-gateway
[I 2022-10-05 14:23:56.298 EnterpriseGatewayApp] SparkOperatorProcessProxy: kernel launched. Kernel image: elyra/kernel-spark-py:dev, KernelID: 9aa71f30-e4c7-4109-84ff-db199050d0c0, cmd: '['/opt/conda/bin/python3.9', '/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py', '--RemoteProcessProxy.kernel-id', '9aa71f30-e4c7-4109-84ff-db199050d0c0', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.response-address', '10.1.0.9:8877', '--RemoteProcessProxy.public-key', 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDF+Lan9asAQdZ6xgoD2XWQOB8Y/76nDlb+hrL3Dkeb0lhquMTV+TOkPXJx2i+eFBY8IZv8awTtbAMMX4fZuaRChtlBnxg2CXvVcNSAzYJwzLIbjf0Y/TCz4L4xrvQQR7f7H0sMu/rmUTKUeTSy9JCWeXMnPwx5drEiPUFRxgPMdwIDAQAB']'
Traceback (most recent call last):
File "/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py", line 114, in <module>
launch_custom_resource_kernel(
File "/usr/local/share/jupyter/kernels/custom_spark_python_operator/scripts/launch_custom_resource.py", line 61, in launch_custom_resource_kernel
client.CustomObjectsApi().create_namespaced_custom_object(
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
return self.api_client.call_api(
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/rest.py", line 275, in POST
return self.request("POST", url,
File "/opt/conda/lib/python3.9/site-packages/kubernetes/client/rest.py", line 234, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '56a44f9c-afef-42d9-bd17-4fe6faf9cdd3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'text/plain; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '8eb8854c-8dfa-4ca6-9815-6905ae679fc3', 'X-Kubernetes-Pf-Prioritylevel-Uid': '97bf8420-edf0-4f30-a734-f589ee14c6d3', 'Date': 'Wed, 05 Oct 2022 14:23:56 GMT', 'Content-Length': '19'})
HTTP response body: 404 page not found
This should not be related since it's happening during the launch (unless this is the launcher of the "changed-to" kernel I suppose).
I see you're using the shared namespace (EG's namespace). Is that also the case in the original issue?
I'm sorry, but I'm not familiar with CRDs so will need to come up to speed myself. Are there things that need to happen on the k8s server to enable these process proxies?
The issue immediately above seems like a configuration kind of thing (404 on the internal create_namespaced_custom_object_with_http_info()
method).
@vivekbhide - your issue is because you must not have the Spark Operator installed in your k8s config, or its not setup to monitor all namespaces. If I use these instructions to add a spark-operator
namespace, then I can get past the 404 issue (which I reproduced prior to running the two helm
commands).
We need to improve our docs in this area! At a minimum, add a link to the "Note" in this section: https://jupyter-enterprise-gateway.readthedocs.io/en/latest/contributors/system-architecture.html#process-proxy
@kevin-bates - thanks.. You were right.. I was missing the spark operator and after installing I got past that error too. I will retest the scenario and let you know my result.
@kevin-bates - Implementing suggested changes have resolved this issue. I could confirm that, on switching the kernel, the old spark executor pods get deleted, old driver pod goes to succeeded
state and a new driver pod gets active. Please note that I continue to have terminate_custom_resource()
as I haven't tested any other scenario and not fully aware of its scope
Excellent - thank you for checking that out!
Regarding terminate_custom_resource()
, it's only used within crd.py and can probably be prefixed with _
(although I'd prefer it just be incorporated into the terminate_container_resources()
.
I think we should also update the docs with a link to installing the spark operator and try to detect the issue (and suggest its installation) by trapping that error during the CRD launch - but these can be a separate issue.
I'm away from work yesterday and today (and thru the weekend) but will try to create an issue for this additional work soon.
@vivekbhide - are you planning on submitting a pull request to address this issue? If so, it would be appreciated soon. Thank you.
@kevin-bates - Hi Kevin, Sorry but got held up with something so couldn't do it before. Looks like @kiersten-stokes has already created the PR for this fix so would keep it that way. Thank you @kiersten-stokes
Excellent - thank you both!
Description
While using
Spark Operator (Python)
kernel from EG, the Spark driver and executor pods continues to stay alive even after kernel is switched to different one in corresponding notebook. This don't seem to be case with other Kubernetes kernels likePython on Kubernetes
where switching to a different kernel kills the pod for old kernelSteps to Reproduce
Setup -
Spark Operator (Python)
calledCustom Spark Operator (Python)
on EG pod. Content of the kernel directory are identical. Purpose is to be able to switch between similar kernels which could very well be the real life scenario (two spark kernels with different set of libraries)Test Scenario -
Custom Spark Operator (Python)
Spark Operator (Python)
Expected result -
Custom Spark Operator (Python)
should get deleted (meaning executor pods are deleted and driver pod is transitioned tosucceeded
and hence will be cleaned by garbage collection or manual removal) and a new driver pod is initiated forSpark Operator (Python)
Actual result -
Custom Spark Operator (Python)
continue to be in running state with new set of pod getting initiated forSpark Operator (Python)
Custom Spark Operator (Python)
shows the shutdown signal log but doesn’t kill the podsObservations -
succeeded
if the kernel is switched. No matter if it had the executor pods previous attached for doing any tasks. This can be tested by invokingspark.stop()
at the end of the notebook.spark.stop()
call is responsible for killing the executor pods.