Closed kevin-bates closed 7 years ago
@kevin-bates Thanks for sharing that info and the offer to contribute a PR. Let me share some work happening in the notebook project before you do.
https://github.com/jupyter/notebook/pull/1827 recently added an activity tracking API to the notebook server itself. I've been thinking that JKG 2.0 should making a breaking API change (which it already needs to do in other places for notebook 5.0 compatibility), drop its own activity API, and expose the one from the notebook server. This will cut down on the amount of redundant code we have to maintain and increase API compatibility across the projects.
@minrk mentioned wanting to add an idle kernel culler to the notebook as a follow-on to the PR above. It'd be good to sync on the status of that before implementing a culler here. If one gets added to the notebook server, we can simply turn it on in the JKG.
@parente Thank you for your response and the pointer to the similar effort happening in jupyter/notebook relative to activity tracking. It makes perfect sense to cull from notebook provided there's an ability to walk across the set of known kernels. I'll take a deeper look within that project.
I'd prefer we leave this issue open until a direction is taken - if that's okay with the community.
Adding a culler service makes sense. I did want to do this upstream, but didn't get to it in time for 5.0. It would have been very similar to what you have proposed, though I wasn't planning to add an API endpoint for retrieving the history of cull events. I would prefer not to include that, unless there is a strong reason - there are no other examples of historical data retrievable from the notebook API, only current state. If you'd prefer, you can make the PR directly to the notebook package instead, which can land in 5.1.0.
The primary piece added is an IOLoop instance used to perform the periodic checks.
You shouldn't need this. You can create a PeriodicCallback
attached to the existing IOLoop.current()
already running in the application.
@minkr thank you for your response. I don't have a strong requirement about an endpoint to gather culled kernel information. I was mostly modeling my changes after the activity tracking in JKG. I'm perfectly fine not implementing that.
Thanks for the tip regarding PeriodicCallback - I'll be sure to use the primary IOLoop instance.
Looking at both the notebook and kernel_gateway code, I do have some questions regarding process startup and option handling.
allow_origin
) or would it be possible to have, say, notebook/kernelmanager.py do this and we expose this feature option as --KernelManager.cull_idle_kernel_after_minutes
and --KernelManager.cull_idle_kernel_interval_seconds
? MappingKernelManager.start_kernel()
since there really isn't any shared startup sequence between Notebook and KernelGateway. Is that a correct assessment or am I missing some shared piece? I had expected to see KernelGatewayApp inherit from NotebookApp such that this kind of thing could be placed in NotebookApp.start().Thanks.
I answered my questions above (hopefully correctly) and submitted PR 2215.
Closing this issue since culling will occur in Notebook 5.next.
The culling script didn't work for me (probably my fault) and I didn't have time to debug it, especially as my ~100 users were out of memory and needed a hotfix. I ran this crude bash script to get rid of all of the kernels started in March:
#!/bin/bash
while [ true ]; do
# Find the PID of the oldest process launched by the kernel launcher
OLDEST=$(pgrep -fo ipykernel_launcher)
# An empty string unless this process with pid $OLDEST was started in March
# ... or for some other reason has 'Mar' in the pgrep output, an edge case I didn't worry about
STARTED_IN_MARCH=$(ps -f $OLDEST | grep Mar | awk {'print $5'})
if [ $STARTED_IN_MARCH ]
then
echo "$OLDEST was started in March. Killing process..."
kill $OLDEST
else # The oldest kernel process was not started in March, so stop here.
break
fi
done
Just posting this here in case someone else wants to modify it for their own emergency purposes.
@rgerkin - Culling has been built into Notebook since the 5.0 release. Since Kernel Gateway depends on Notebook, you can configure it just as you would for notebook. For example, the following parameters will check for disconnected kernels that have been idle for 10 hours every 5 minutes...
--MappingKernelManager.cull_idle_timeout=36000 --MappingKernelManager.cull_interval=300
I'm a little surprised your script didn't trigger the auto-restart functionality built into Jupyter because it doesn't go through the Kernel Manager to terminate the kernel processes. If you didn't see that behavior, then the kernels were probably orphaned, perhaps due to restarts of JKG.
Btw, if you're having resource issues due to the kernels for all of your users being pinned to the JKG server, you might take a look at Jupyter Enterprise Gateway. JEG is built directly only JKG, but can manage remote kernels distributed across a compute cluster using resource managers like Hadoop YARN, Kubernetes, Docker Swarm, etc. This is the primary reason JEG exists.
@kevin-bates Thanks! Can I put those into my jupyterhub_config.py
as well and get that behavior on my next hub restart?
I'm not very familiar with hub configurations, but according to this, it looks like you could add these options to c.Spawner.args
or configure in the jupyter_notebook_config.py
file.
I suppose there's a chance that taking the latter approach may prevent the need for restarting Hub, but I don't know for sure.
Hey Everyone. I have been following this thread and have a very basic question!
There are 2 sets of culling parameters - one for Kernels and the other for Kernel Gateways. What exactly is the difference between them and which do we configure when? A little explanation here would be helpful to me. I read the documentation but there this isn't very clear.
Hi @SonakshiGrover.
The culling of kernels - which was initially implemented in Kernel Gateway but then moved into the Notebook server - is tied to the MappingKernelManager.cull*
options.
The culling of servers is a JupyterHub configuration option that monitors activities against a particular user's spawned notebook server. JupyterHub does not typically (although it could probably be configured to) spawn Gateways since these are inherently "multi-user". Instead, the Hub operator would configure the spawned notebook servers to redirect their kernel operations to a single Kernel (or Enterprise) Gateway. My understanding is that server culling is configured at the Hub level.
If this doesn't clear things up, could you please include references to where you see the culling parameters for Kernel Gateways? (Thanks)
@kevin-bates. Thanks for the info!
Sorry for not making my question clear enough. I am referring to the cull options listed here
In this page, there are MappingKernelManager.cull
options as well as GatewayKernelManager.cull
options. So based on what you explained above, these parameters are at the Notebook server level. But here I am not sure of the difference between these 2 sets of options. I think it essentially boils down to difference between kernels and gateway kernels in jupyter's mechanics, which I am not sure of. If you could shed some light on this or share some references, then that would be very helpful.
Ah - okay - thanks for that additional reference.
The GatewayKernelManager
culling options are not applicable since that particular kernel manager is what is redirecting requests to gateway server where MappingKernelManager
does the actual work (and management). Unfortunately, because GatewayKernelManager
derives from MappingKernelManager
, it shows "inherited" configurable options that, I agree, would be confusing. Since --help-all
output is auto-generated, I'm not aware how one can prevent the output of inherited configurable traits that are not applicable to the derived class (in this case, GatewayKernelManager
).
I see. This certainly clears up my doubts! In my use case, I just need to make sure that notebook kernels are culled at regular intervals and I now I understand that I only need to configure the MappingKernelManager.cull
options. I was earlier confused about the GatewayKernelManager
options and whether they are in any way affecting MappingKernelManager.cull
options, but now its clear to me.
Thanks a lot for the clarification @kevin-bates
Hello,
This topic is essentially an extension to Issue #96 which led to PR #97 and the introduction of activity monitoring in the JKG.
I've implemented a solution within the JKG which introduces another internal service (idle_culler) to periodically check the current set of registered kernels to see if they've been idle for too long, at which time they are culled (deleted). The culling behavior leverages the activity monitoring and defines an idle kernel as the time since the most recent message to the client (
last_message_to_client
) or kernel (last_message_to_kernel
). It could easily be extended to incorporate any of the other activity metrics as well. As @rwhorman mentioned in #96, the typical period of a given JKG would be on the order of 12-24 hours (again, thinking spark jobs and resource consumption). Two options have been introduced:KernelGatewayApp.cull_idle_kernel_period
indicating the allowed idle time in seconds. Default = 0, i.e., off.KernelGatewayApp.cull_idle_kernel_interval
indicating the frequency (seconds) at which to check for idle kernels. Default = 300.Since I'm new to python (and open source), I wanted to run my general approach past others initially. I chose to use a separate module (services/idle_culler), but this could easily be incorporated in the activity module. The primary piece added is an IOLoop instance used to perform the periodic checks. I also added code to capture the time the kernel was culled and its idle duration at the time of its culling. The idea being that we could expose a GET request (/_api/activity/culled or /_api/culled/) to gather this information if that's desired.
I have yet to finish the handler plumbing, nor have I added any specific code to test this. Kinda wanted to make sure I'm barking up the right tree.
Does this sound like something that would be useful to include in the kernel gateway? I'd be happy to submit a PR so you can take a closer look.
Regards, Kevin.