pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
57 stars 32 forks source link

Implement dask-gateway cluster limits per-user #693

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

The first iteration of dask-gateway allowed easily setting limits per-user. As of 0.7.1 this is no longer the case due to some refactoring. Now we can only directly limit the overall cluster as described here https://github.com/dask/dask-gateway/issues/186#issuecomment-577730825.

I think a top priority is figuring out user limits via kubernetes ResourceQuotas as suggested in that issue. Has anybody looked into this yet? @TomAugspurger, @consideRatio, @salvis2 ?

A user can currently run code like this accidentally and end up with many clusters and tons of idle workers (until kernel is restarted or they log out). Similarly users can end up using all available workers at the expense of other users.

from dask_gateway import GatewayCluster
cluster = GatewayCluster()
cluster.scale(4)
cluster
TomAugspurger commented 3 years ago

I haven't looked into kubernetes-based limits yet.

On Wed, Jul 29, 2020 at 8:47 PM Scott Henderson notifications@github.com wrote:

The first iteration of dask-gateway allowed easily setting limits per-user. As of 0.7.1 this is no longer the case due to some refactoring. Now we can only directly limit the overall cluster as described here dask/dask-gateway#186 (comment) https://github.com/dask/dask-gateway/issues/186#issuecomment-577730825.

I think a top priority is figuring out user limits via kubernetes ResourceQuotas as suggested in that issue. Has anybody looked into this yet? @TomAugspurger https://github.com/TomAugspurger, @consideRatio https://github.com/consideRatio, @salvis2 https://github.com/salvis2 ?

A user can currently run code like this accidentally and end up with many clusters and tons of idle workers (until kernel is restarted or they log out). Similarly users can end up using all available workers at the expense of other users.

from dask_gateway import GatewayCluster cluster = GatewayCluster() cluster.scale(4) cluster

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXQ4UII4RRGUR7PMDTR6DGMLANCNFSM4PMQUIOQ .

consideRatio commented 3 years ago

I have not explored this thoroughly or concluded a suggested course of action, but here is a Rubber duck text about what I considered.

Dear rubber duck!

I'm not sure how to go about this in a k8s native way without having dedicated namespaces for users, which will require a lot of additional work. But, is this assumption correct - do we need dedicated namespaces for users?

So ResourceQuota can be applied in a namespace, on a certain scope, as described by a scopeSelector. It is defined in the k8s api reference. But, it seems like it cannot target a scope of pods based on a label as discussed here, but can only instead target pods with a certain PriorityClass it seems.

Pod's can be assigned a specific priorityClassName, so if we create one for each user, they could be targeted perhaps... But this is quite a messy solution as well I think.