Add ability to cull idle kernels after specified period

kevin-bates commented 7 years ago

Hello,

This topic is essentially an extension to Issue #96 which led to PR #97 and the introduction of activity monitoring in the JKG.

I've implemented a solution within the JKG which introduces another internal service (idle_culler) to periodically check the current set of registered kernels to see if they've been idle for too long, at which time they are culled (deleted). The culling behavior leverages the activity monitoring and defines an idle kernel as the time since the most recent message to the client (last_message_to_client) or kernel (last_message_to_kernel). It could easily be extended to incorporate any of the other activity metrics as well. As @rwhorman mentioned in #96, the typical period of a given JKG would be on the order of 12-24 hours (again, thinking spark jobs and resource consumption). Two options have been introduced:

KernelGatewayApp.cull_idle_kernel_period indicating the allowed idle time in seconds. Default = 0, i.e., off.
KernelGatewayApp.cull_idle_kernel_interval indicating the frequency (seconds) at which to check for idle kernels. Default = 300.

I'd probably add units to the name and use minutes for the period, seconds for the interval.

Since I'm new to python (and open source), I wanted to run my general approach past others initially. I chose to use a separate module (services/idle_culler), but this could easily be incorporated in the activity module. The primary piece added is an IOLoop instance used to perform the periodic checks. I also added code to capture the time the kernel was culled and its idle duration at the time of its culling. The idea being that we could expose a GET request (/_api/activity/culled or /_api/culled/) to gather this information if that's desired.

I have yet to finish the handler plumbing, nor have I added any specific code to test this. Kinda wanted to make sure I'm barking up the right tree.

Does this sound like something that would be useful to include in the kernel gateway? I'd be happy to submit a PR so you can take a closer look.

Regards, Kevin.

parente commented 7 years ago

@kevin-bates Thanks for sharing that info and the offer to contribute a PR. Let me share some work happening in the notebook project before you do.

https://github.com/jupyter/notebook/pull/1827 recently added an activity tracking API to the notebook server itself. I've been thinking that JKG 2.0 should making a breaking API change (which it already needs to do in other places for notebook 5.0 compatibility), drop its own activity API, and expose the one from the notebook server. This will cut down on the amount of redundant code we have to maintain and increase API compatibility across the projects.

@minrk mentioned wanting to add an idle kernel culler to the notebook as a follow-on to the PR above. It'd be good to sync on the status of that before implementing a culler here. If one gets added to the notebook server, we can simply turn it on in the JKG.

kevin-bates commented 7 years ago

@parente Thank you for your response and the pointer to the similar effort happening in jupyter/notebook relative to activity tracking. It makes perfect sense to cull from notebook provided there's an ability to walk across the set of known kernels. I'll take a deeper look within that project.

I'd prefer we leave this issue open until a direction is taken - if that's okay with the community.

minrk commented 7 years ago

Adding a culler service makes sense. I did want to do this upstream, but didn't get to it in time for 5.0. It would have been very similar to what you have proposed, though I wasn't planning to add an API endpoint for retrieving the history of cull events. I would prefer not to include that, unless there is a strong reason - there are no other examples of historical data retrievable from the notebook API, only current state. If you'd prefer, you can make the PR directly to the notebook package instead, which can land in 5.1.0.

The primary piece added is an IOLoop instance used to perform the periodic checks.

You shouldn't need this. You can create a PeriodicCallback attached to the existing IOLoop.current() already running in the application.

kevin-bates commented 7 years ago

@minkr thank you for your response. I don't have a strong requirement about an endpoint to gather culled kernel information. I was mostly modeling my changes after the activity tracking in JKG. I'm perfectly fine not implementing that.

Thanks for the tip regarding PeriodicCallback - I'll be sure to use the primary IOLoop instance.

Looking at both the notebook and kernel_gateway code, I do have some questions regarding process startup and option handling.

Since it appears that NotebookApp and KernelGatewayApp are two different entrypoints, do we need to duplicate the handling of options (as seems the case for allow_origin) or would it be possible to have, say, notebook/kernelmanager.py do this and we expose this feature option as --KernelManager.cull_idle_kernel_after_minutes and --KernelManager.cull_idle_kernel_interval_seconds?
Likewise, it seems the construction and start of the periodicCallback would need to happen in a once-only fashion in MappingKernelManager.start_kernel() since there really isn't any shared startup sequence between Notebook and KernelGateway. Is that a correct assessment or am I missing some shared piece? I had expected to see KernelGatewayApp inherit from NotebookApp such that this kind of thing could be placed in NotebookApp.start().

Thanks.

kevin-bates commented 7 years ago

I answered my questions above (hopefully correctly) and submitted PR 2215.

kevin-bates commented 7 years ago

Closing this issue since culling will occur in Notebook 5.next.

rgerkin commented 5 years ago

The culling script didn't work for me (probably my fault) and I didn't have time to debug it, especially as my ~100 users were out of memory and needed a hotfix. I ran this crude bash script to get rid of all of the kernels started in March:

#!/bin/bash

while [ true ]; do
    # Find the PID of the oldest process launched by the kernel launcher
    OLDEST=$(pgrep -fo ipykernel_launcher)
    # An empty string unless this process with pid $OLDEST was started in March
    # ... or for some other reason has 'Mar' in the pgrep output, an edge case I didn't worry about
    STARTED_IN_MARCH=$(ps -f $OLDEST | grep Mar | awk {'print $5'})
    if [ $STARTED_IN_MARCH ]
    then
    echo "$OLDEST was started in March.  Killing process..."
    kill $OLDEST
    else # The oldest kernel process was not started in March, so stop here.  
    break
    fi
done

Just posting this here in case someone else wants to modify it for their own emergency purposes.

kevin-bates commented 5 years ago

@rgerkin - Culling has been built into Notebook since the 5.0 release. Since Kernel Gateway depends on Notebook, you can configure it just as you would for notebook. For example, the following parameters will check for disconnected kernels that have been idle for 10 hours every 5 minutes... --MappingKernelManager.cull_idle_timeout=36000 --MappingKernelManager.cull_interval=300

I'm a little surprised your script didn't trigger the auto-restart functionality built into Jupyter because it doesn't go through the Kernel Manager to terminate the kernel processes. If you didn't see that behavior, then the kernels were probably orphaned, perhaps due to restarts of JKG.

Btw, if you're having resource issues due to the kernels for all of your users being pinned to the JKG server, you might take a look at Jupyter Enterprise Gateway. JEG is built directly only JKG, but can manage remote kernels distributed across a compute cluster using resource managers like Hadoop YARN, Kubernetes, Docker Swarm, etc. This is the primary reason JEG exists.

rgerkin commented 5 years ago

@kevin-bates Thanks! Can I put those into my jupyterhub_config.py as well and get that behavior on my next hub restart?

kevin-bates commented 5 years ago

I'm not very familiar with hub configurations, but according to this, it looks like you could add these options to c.Spawner.args or configure in the jupyter_notebook_config.py file.

I suppose there's a chance that taking the latter approach may prevent the need for restarting Hub, but I don't know for sure.

SonakshiGrover commented 3 years ago

Hey Everyone. I have been following this thread and have a very basic question!

There are 2 sets of culling parameters - one for Kernels and the other for Kernel Gateways. What exactly is the difference between them and which do we configure when? A little explanation here would be helpful to me. I read the documentation but there this isn't very clear.

kevin-bates commented 3 years ago

Hi @SonakshiGrover.

The culling of kernels - which was initially implemented in Kernel Gateway but then moved into the Notebook server - is tied to the MappingKernelManager.cull* options.

The culling of servers is a JupyterHub configuration option that monitors activities against a particular user's spawned notebook server. JupyterHub does not typically (although it could probably be configured to) spawn Gateways since these are inherently "multi-user". Instead, the Hub operator would configure the spawned notebook servers to redirect their kernel operations to a single Kernel (or Enterprise) Gateway. My understanding is that server culling is configured at the Hub level.

If this doesn't clear things up, could you please include references to where you see the culling parameters for Kernel Gateways? (Thanks)

SonakshiGrover commented 3 years ago

@kevin-bates. Thanks for the info!

Sorry for not making my question clear enough. I am referring to the cull options listed here In this page, there are MappingKernelManager.cull options as well as GatewayKernelManager.cull options. So based on what you explained above, these parameters are at the Notebook server level. But here I am not sure of the difference between these 2 sets of options. I think it essentially boils down to difference between kernels and gateway kernels in jupyter's mechanics, which I am not sure of. If you could shed some light on this or share some references, then that would be very helpful.

kevin-bates commented 3 years ago

Ah - okay - thanks for that additional reference.

The GatewayKernelManager culling options are not applicable since that particular kernel manager is what is redirecting requests to gateway server where MappingKernelManager does the actual work (and management). Unfortunately, because GatewayKernelManager derives from MappingKernelManager, it shows "inherited" configurable options that, I agree, would be confusing. Since --help-all output is auto-generated, I'm not aware how one can prevent the output of inherited configurable traits that are not applicable to the derived class (in this case, GatewayKernelManager).

SonakshiGrover commented 3 years ago

I see. This certainly clears up my doubts! In my use case, I just need to make sure that notebook kernels are culled at regular intervals and I now I understand that I only need to configure the MappingKernelManager.cull options. I was earlier confused about the GatewayKernelManager options and whether they are in any way affecting MappingKernelManager.cull options, but now its clear to me.

Thanks a lot for the clarification @kevin-bates

jupyter-server / kernel_gateway

Add ability to cull idle kernels after specified period #226