jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
623 stars 222 forks source link

Remove blocking Kernel Startup #86

Closed aazhou1 closed 4 years ago

aazhou1 commented 7 years ago

Single-threaded kernel startup causing notebook front end to lose connection to gateway as a lot of kernels are being started--so that notebook has to re-pull kernel specs from gateway after Elyra finishes starting kernels.

kevin-bates commented 7 years ago

I believe this issue is a result of the duration it takes to "start" a kernel relative to a remote cluster. In normal circumstances (local kernels) the duration is on the order of a second (to invoke the process and have the kernel respond). For remote kernels running as yarn-cluster, that duration is an order of magnitude longer (10-15 seconds) which exacerbates as blocking startups on the front end.

I tried to look into this early on but couldn't make any headway because the Jupyter framework assumes the kernel is communicating immediately on return from the startup logic (so using a thread doesn't really help).

kevin-bates commented 7 years ago

I haven't been able to make any headway into this. The MappingKernelManager is async in that it uses the @gen.coroutine decorator. This "multi kernel manager" instantiates a "single kernel manager" - which is responsible for starting the kernel - that, when remote - takes several seconds.

I tried introducing asynchronous behaviors at various levels within this framework - either at the point of launch, higher up at the start of the "single kernel manager", or even higher within the "multi-kernel manager" sequence - as well as replacing time.sleep with yield gen.sleep. In all cases, any form of yield returns to the caller - as if the call did not occur. This causes control to be returned back to the user too soon.

In addition, I suspect the current asynchronous intentions are not working as expected, but that has gone unnoticed because regular kernel startups are on the order of one, maybe two, seconds.

Here are some links I found that seemed applicable to what we need: http://blog.trukhanov.net/Running-synchronous-code-on-tornado-asynchronously/ https://stackoverflow.com/questions/18088176/calling-tornado-coroutines-from-synchronous-code https://stackoverflow.com/questions/13051591/tornado-blocking-asynchronous-requests

Placing back on the backlog for now. It would be great if someone else could spend some time with this. I'd be happy to share more about what I tried.

Carreau commented 6 years ago

As promised in another thread, some comments on this; Apologies for the delay, I've been pretty busy on other things (like deleting root's home on one of the cluster I manage because i'm stupid...)

I believe there is a path forward that keep backward compatibility but will allow us to slowly migrate to async-almost-everything and not block kernel launching. I'm going to use python 3 syntax, but it should be possible to do that on Python 2 with @gen.coroutine/yield but it's less clear in explanations.

1) We need slowly migrate all the current API from def function() to async def function(), and prefer the second one. The nice thing is that async def function() can be blocking, that is not a problem, but it's hard to make def function() non-blocking.

For client that do not run in an event-loop, and are "quasi sync", you can fake an event loop that just advance everything like in there

2) We need to provide backward compatibility, that is to say we need to provide, and be able to call call 2 kinds of methods:

In python 3 at least, you can figure out whether a function is synchrone using inspect.iscoroutinefunction. If it's sync then we just call it, if it's async we yield from it.

I suggest the following for every method def XXX(...) that participate in kernel management we define:

For every method foo we currently call check anc call in order:

Everytime we encounter a non-async function, or a non-async function get called have extensive deprecation warning that ask users to:

I've been toying with that in https://github.com/jupyter/jupyter_client/pull/402, and https://github.com/jupyter/notebook/pull/4126, I just need to find the right abstraction and pattern to allow the transition to be as seamless as possible for users.

As far as I can tell, we should be able to do a transition where all of the new API are compatible with the future ones with minimal disruptions and code change. I believe with the right tooling, documentation and example that should be something relatively systematic to do, so the would just be a question of time and outreach to review all the possible places that need to be updated.

Apologies if this is a bit unclear, I'm still trying to arrange things in my head, and I still need to dive into kernel and enterprise gateway to see how it affects or would simplify those.

kevin-bates commented 6 years ago

@Carreau - thank you for your comment. Yes, I've been monitoring your WIP PRs and agree that its best to see how the dust settles on those. In addition, @takluyver's WIP PR may essentially eliminate all kernel management from Enterprise Gateway anyway - so we need to wait to see where his effort takes us. I would recommend you coordinate with Thomas, although I suspect you have already done that. Looks like Thomas is accounting for async kernel providers also - all good.

In the meantime, we will continue to monitor these kernel management PRs and contribute where we can. Our primary interest is also in preserving the existing functionality we provide and to minimize any migration challenges that lie ahead when adopting these new architectures.

takluyver commented 6 years ago

Looks like Thomas is accounting for async kernel providers also - all good.

To be honest, not really. I tried to figure this out, and there are launch_async methods in the API, but it all got too messy, and I've wound up using only the blocking APIs to launch and manage kernels. I have a working async client built on the tornado interface, but not the provider/manager part.

Part of the muddle I got into was that I was planning to build everything directly on asyncio, but then there's a nice ZMQ 'stream' interface built on tornado, and then I started reading about trio, and...

esevan commented 5 years ago

@kevin-bates Would you mind if I ask nb2kg image for this patch?

I'll appreciate it if any guide is given as well for upgrading jupyter notebook and client in nb2kg Dockerfile.

kevin-bates commented 5 years ago

@esevan - Although I performed validation of the changes via the REST API, an update to the nb2kg image shouldn't be necessary since the actual changes to enable concurrency really take place in EG via #580. The notebook and jupyter_client changes essentially facilitate the changes in EG. If you're finding that not to be the case, please let me know along with how you go about starting kernels concurrently via the NB2KG image and I'd be happy to take a look.

esevan commented 5 years ago

@kevin-bates I'm using this image as my notebook environment. https://github.com/jupyter/enterprise_gateway/blob/master/etc/docker/nb2kg/Dockerfile

I expected this included notebook and client, that It should be changed to facilitate #580.

kevin-bates commented 5 years ago

That's correct that the image uses notebook and jupyter_client, but the actual kernel startup and management gets proxied (via NB2KG) to the Enterprise Gateway server. That's where the practical portion of the concurrency changes reside. The NB and JC changes are merely to turn some existing methods into coroutines, but there weren't any additional changes made that trigger significant concurrency. The primary change was to replace the time.sleep() calls with yield gen.sleep() within confirm_remote_startup() and handle_timeout(), but to do that various calling methods (that include superclass methods in NB/JC) required they be decorated with @gen.coroutine.

I've attached the 3 dev-built whl files for EG, NB and JC. You must pip install EG before NB since the NB version violates EG's constraints. async-startup-wheels.zip

If you only need the applicable EG image, I've pushed one that can be pulled using: docker pull kbates/enterprise-gateway:async

I hope that helps.

esevan commented 5 years ago

@kevin-bates So you just modified EG process embedding JC and NB. I thought client side was also changed, and now I got it. Thanks a lot!

esevan commented 5 years ago

I've studied kernel startup source codes working in my environment (k8s) on my free time. Here's just a graffiti as a result of it. Just wanna share this for anyone studying enterprise_gateway in source code level like me. Now I fully understood why I don't need nb2kg image. image

kevin-bates commented 5 years ago

@esevan - this is excellent!