nebari-dev / nebari

🪴 Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
274 stars 88 forks source link

[ENH] - Make idle culler settings easily configurable and documented how to change #1283

Closed costrouc closed 1 year ago

costrouc commented 2 years ago

Feature description

Currently much of the idle culler is hard coded. @rsignell-usgs brought this up as an issue that he was concerned about. The current timeout is too short in some cases.

Value and/or benefit

The default idle timeout does not work for everyone.

Anything else?

No response

viniciusdc commented 2 years ago

Hi @costrouc, do they want a config per user structure, or they are happy to get it set in the qhub-config?

rsignell-usgs commented 2 years ago

@viniciusdc and @costrouc , we would be happy to set this in the qhub-config.
One of the worst aspects of having the time out being so short is that any terminal sessions disappear. Thanks for taking a look!

rsignell-usgs commented 2 years ago

Folks, what would it take to enable this?

This is the number top complaint I've heard from ESIP Qhub users.

Even if it wasn't configurable and just made longer by qhub devs, that would be wonderful. Right now it must be 5 minutes, right?

It would be great if dask clusters spun down in 30 min, and notebooks spun down in 90 min or 3 hours.

Just for comparison, AWS SageMaker Studio Lab, the free notebook offering from AWS, times out after 4 hours for a GPU, 12 hours for a CPU.

iameskild commented 2 years ago

Hi @rsignell-usgs, I will make sure this issue is prioritized for our next sprint (which starts next week). I can't promise it will be configurable from the qhub-config.yaml but I will work with the team to come up with a workable solution asap. Thanks again for the reminder!!

rsignell-usgs commented 2 years ago

Okay, thanks @iameskild. The users will definitely appreciate any improvement in the situation, even if not configurable!

rsignell-usgs commented 1 year ago

@iameskild , I remember you showed me how to (temporarily) override the short culler settings by connecting to some pod and editing a config file, right? After the upgrade from 0.4.3 to 0.4.4, the users are screaming again about the too-short timeout for their servers.

iameskild commented 1 year ago

Hey @rsignell-usgs, for now, you can manually edit the etc-jupyter configmap if you want to make changes to the timeout settings.

Although I still have to circle back to this when I have more time but as a quick update, I was looking into using Terraform's templatefile to make these values more easily configurable.

viniciusdc commented 1 year ago

This can also be achieved using overrides on the jupyterhub configuration to change the idle-culling variable values. Right now, the values that can be changed are those here

jupyterhub:
  overrides:
    cull:
      users: true

Some values come from the idle-culler extension that, as of now, only the above method can be used to update them.

rsignell-usgs commented 1 year ago

To change these, I can use k9s to ssh into the hub-** pod and then just edit them?

iameskild commented 1 year ago

@rsignell-usgs yep, just edit the file. You may need to kill the hub pod for the changes to take effect.

rsignell-usgs commented 1 year ago

What is the filename once I've ssh'ed into the hub pod?

rsignell-usgs commented 1 year ago

Here's the workaround recipe that should modify the cull settings (at least until the next qhub/nebari version is deployed):

Just for the record, I set everything to 30 minutes:


    # The interval (in seconds) on which to check for terminals exceeding the
    # inactive timeout value.
    c.TerminalManager.cull_interval = 30 * 60

    # cull_idle_timeout: timeout (in seconds) after which an idle kernel is
    # considered ready to be culled
    c.MappingKernelManager.cull_idle_timeout = 30 * 60

    # cull_interval: the interval (in seconds) on which to check for idle
    # kernels exceeding the cull timeout value
    c.MappingKernelManager.cull_interval = 30 * 60

    # cull_connected: whether to consider culling kernels which have one
    # or more connections
    c.MappingKernelManager.cull_connected = True

    # cull_busy: whether to consider culling kernels which are currently
    # busy running some code
    c.MappingKernelManager.cull_busy = False

    # Shut down the server after N seconds with no kernels or terminals
    # running and no activity.
    c.NotebookApp.shutdown_no_activity_timeout = 30 * 60