jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
617 stars 222 forks source link

851: Add typing hints to process proxies #1129

Closed bloomsa closed 2 years ago

bloomsa commented 2 years ago

Pull Request for issue #851 A couple of notes on my process for making the type hints and an ask for some assistance:

  1. As noted in the inline comments, I could use some help with the typing on ConductorClusterProcessProxy. Mainly around the Conductor API responses as I am not familiar with it nor could I find documentation on it that looked right for the APIs being hit in these methods.
  2. In the case of abstract classes (BaseProcessProxyABC and RemoteProcessProxy) I erred on the side of adding hints only to methods that were not implemented in subclasses. For ex. I added hints for all the _xyz methods in the abstract classes but not methods such as poll, launch_process, etc. I did this because the implementation classes may use different types for arguments or return object than the base abstract classes. Although I am open to changing this approach if it is not best practice. Thanks!
bloomsa commented 2 years ago

looks like I have to fix some things up based on failing tests, I will work on that soon

kevin-bates commented 2 years ago

@bloomsa - this is looking great so far - thank you!

I could use some help with the typing on ConductorClusterProcessProxy. Mainly around the Conductor API responses as I am not familiar with it nor could I find documentation on it that looked right for the APIs being hit in these methods.

Yeah, the ConductorClusterProcessProxy was a contribution and I suppose we could ping that contributor, but I think you'll find it extremely similar to the YarnClusterProcessProxy. Do you have specific examples of where responses are unclear? Since my understanding is that it's just a REST API, the responses are probably JSON (dict), but I suppose str could be used as well.

but not methods such as poll, launch_process, etc. I did this because the implementation classes may use different types for arguments or return object than the base abstract classes.

Hmm. I think it would be good where types diverge from the base class. I would expect them to be the same and perhaps that's a sign of a "smell" we should dig into. Again, any specific examples would be great.

Thanks again for contributing this - it will be extremely helpful to other contributors.

bloomsa commented 2 years ago

@kevin-bates

Yeah, the ConductorClusterProcessProxy was a contribution and I suppose we could ping that contributor, but I think you'll find it extremely similar to the YarnClusterProcessProxy. Do you have specific examples of where responses are unclear? Since my understanding is that it's just a REST API, the responses are probably JSON (dict), but I suppose str could be used as well.

In other processProxy classes like for Docker I was using the actual Docker API model classes, so trying to do that for Conductor was leading me to some dead ends. I will just go with dict or str for these for now

Hmm. I think it would be good where types diverge from the base class. I would expect them to be the same and perhaps that's a sign of a "smell" we should dig into. Again, any specific examples would be great.

Looking again, the only place I see this happening is in launch_process where the implementation classes all return self while the abstract class returns nothing (KubernetesProcessProxy vs abstract base class). Thinking more about it now, is it because the abstract class's implementation is there to be called by the subclasses using super..., as the comments mention?

kevin-bates commented 2 years ago

Hi @bloomsa - good points.

In other processProxy classes like for Docker I was using the actual Docker API model classes, so trying to do that for Conductor was leading me to some dead ends. I will just go with dict or str for these for now

Yeah, I'm having similar issues finding the correct REST API docs - none of which appear to have endpoints like v1/applications or v1/submissions. I think the best of approach would be to go with dict and str for now.

@kjdoyle (I hope you're doing well) - would you happen to know where the corresponding REST API docs for the ConductorClusterProcessProxy happen to reside?

Looking again, the only place I see this happening is in launch_process where the implementation classes all return self while the abstract class returns nothing (KubernetesProcessProxy vs abstract base class). Thinking more about it now, is it because the abstract class's implementation is there to be called by the subclasses using super..., as the comments mention?

While it's true that we require all subclasses to call super() first, I don't think that should preclude us from returning self from the various super() calls - it will always be the same value. (This is one of the benefits of adding type hints - to clarify things like this.) As a result, since the signatures should match, it seems like launch_process would wind up looking something like this...

    async def launch_process(self, kernel_cmd: List[str], **kwargs: Optional[dict[str, Any]]) -> BaseProcessProxyABC:

I would prefer that each implementation of launch_process() indicate that that class instance is returned, but I don't know if that will cause linting or IDEs heartburn. If they appear clean, then let's go with indicating the implementing class rather than BaseProcessProxyABC.

Of course the recipient of launch_process() (in RemoteKernelManager._launch_kernel()) should be typed to receive BaseProcessProxyABC. However, we probably need to be careful about this since the process-proxies masquerade as Popen instances, so I'm not sure we'd even want to indicate the return type on RemoteKernelManager._launch_kernel() - does that make sense?

bloomsa commented 2 years ago

While it's true that we require all subclasses to call super() first, I don't think that should preclude us from returning self from the various super() calls - it will always be the same value. (This is one of the benefits of adding type hints - to clarify things like this.) As a result, since the signatures should match, it seems like launch_process would wind up looking something like this...

    async def launch_process(self, kernel_cmd: List[str], **kwargs: Optional[dict[str, Any]]) -> BaseProcessProxyABC:

I would prefer that each implementation of launch_process() indicate that that class instance is returned, but I don't know if that will cause linting or IDEs heartburn. If they appear clean, then let's go with indicating the implementing class rather than BaseProcessProxyABC.

The BaseProcessProxyABC abstract definition of launch_process does not return self, it returns None (unless I am missing something in the code). The RemoteProcessProxy also does not return anything, only when we get to the actual implementation classes is there a return value, which is the implementing class as you said.

On the topic of indicating that with type hints, I had to do that using Type["ProcessProxyImplementationClassXYZ"] (see here). Looking into it more, I might be able to replace that with 'ProcessProxyImplementationClassXYZ', I'll give that a shot. As of current Python versions, I cannot simply state the return type as ProcessProxyImplementationClassXYZ as the IDE/linter would complain. (ref)

Of course the recipient of launch_process() (in RemoteKernelManager._launch_kernel()) should be typed to receive BaseProcessProxyABC. However, we probably need to be careful about this since the process-proxies masquerade as Popen instances, so I'm not sure we'd even want to indicate the return type on RemoteKernelManager._launch_kernel() - does that make sense? Yes that makes sense, the best way to go about that may be using type hinting (claiming that the processProxy is returned) in combination with a comment indicating the masquerade that is happening

kevin-bates commented 2 years ago

The BaseProcessProxyABC abstract definition of launch_process does not return self, it returns None (unless I am missing something in the code). The RemoteProcessProxy also does not return anything, only when we get to the actual implementation classes is there a return value, which is the implementing class as you said.

Yes, you are correct. My comment prior to this...

I don't think that should preclude us from returning self from the various super() calls - it will always be the same value.

was suggesting that we should return self from the base class onward. This way the type annotations are consistent throughout the class hierarchy.

bloomsa commented 2 years ago

The BaseProcessProxyABC abstract definition of launch_process does not return self, it returns None (unless I am missing something in the code). The RemoteProcessProxy also does not return anything, only when we get to the actual implementation classes is there a return value, which is the implementing class as you said.

Yes, you are correct. My comment prior to this...

I don't think that should preclude us from returning self from the various super() calls - it will always be the same value.

was suggesting that we should return self from the base class onward. This way the type annotations are consistent throughout the class hierarchy.

got it, I misread the original comment, sorry. I'd prefer to leave this PR as strictly adding type hints to existing functions without changing any implementations, but will defer to you on whether you want me to change that here or leave it for a future PR.

bloomsa commented 2 years ago

w.r.t. the failed builds, flake is failing due to the undefined name RemoteKernelManager in the ProcessProxy type hints. Unfortunately when adding type hints, the class being referenced is imported, which is leading to circular imports in this case. I tried to resolve it by using the string 'RemoteKernelManager' instead but it looks like it still doesn't work. I found this as a possible solution but it's a kludge imo :/

kevin-bates commented 2 years ago

I suppose we might need to forgo type hints for the kernel_manager variables.

Or capture the inputs (members of KM that are read) in the constructors and outputs (members of KM that are written) as returned from the methods. Not sure if that's feasible and would rather we just forgo hints on KM in process proxies since the latter could get ugly.

bloomsa commented 2 years ago

I see that the test suite has python 3.7 as the minimum tested version. If we force > 3.7 for the project that makes type hinting easier and more performant (see this S/O discussion) Without doing that, I think leaving the kernel_manager un-hinted is probably the easiest solution for now. I'm not sure it's possible to call out the kernel_manager members that are written/updated as the return type like you mentioned. If it is possible, I agree it could be ugly and confusing for people unfamiliar with the project

kevin-bates commented 2 years ago

I agree we should not type-hint kernel_manager in the process proxies over identifying its members and splitting those out.

Regarding python > 3.7, we need to support 3.7 until its EOL in June of 2023. However, the link you provide appears to say that these annotation improvements exist for version >= 3.7. But it also appears to address the references to a class within that class definition. Are you saying the from __future__ import annotations would resolve the circular reference issue and we could use 'RemoteKernelManager' as the annotation?

bloomsa commented 2 years ago

I agree we should not type-hint kernel_manager in the process proxies over identifying its members and splitting those out.

Regarding python > 3.7, we need to support 3.7 until its EOL in June of 2023. However, the link you provide appears to say that these annotation improvements exist for version >= 3.7. But it also appears to address the references to a class within that class definition. Are you saying the from __future__ import annotations would resolve the circular reference issue and we could use 'RemoteKernelManager' as the annotation?

Sorry, I meant to type >= 3.7. I'm not 100% sure as I haven't done it before, but yes your understanding is the same as mine. from __future__ import annotations works starting in 3.7 (and up). I think it will allow us to use 'RemoteKernelManager' without the circular import. I'll give it a try