nebari-dev / nebari

ðŸŠī Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
271 stars 88 forks source link

[ENH] - Add kubernetes horizontal autoscaler for conda-store workers based on queue depth #2284

Open dcmcand opened 4 months ago

dcmcand commented 4 months ago

Feature description

Currently conda-store is set to allow 4 simultaneous builds at once. This is a bottleneck once multiple environments start getting built at once and presents a scaling challenge. If we set the simultaneous builds to 1 and autoscale based on queue depth then we should be able to handle scaling far more gracefully

Value and/or benefit

Having the conda-store workers autoscale based on queue depth will allow larger orgs to take advantage of Nebari without hitting scale bottlenecks.

Anything else?

https://learnk8s.io/scaling-celery-rabbitmq-kubernetes

pt247 commented 4 months ago

Options

We have two options to achieve this:

  1. Horizontal Pod Autoscaler
  2. KEDA (Kubernetes-based Event-driven Autoscaling)

Option#1 Horizontal Pod Autoscaler based on external metrics and a load monitor/watcher.

Ref: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/ The sequence of events:

  1. Build watcher queries the conda-store database every 5 seconds and populates the total number of queued workers.
  2. Horizontal autoscaler takes this value as an external metric to scale on:
    - type: External
    external:
    metric:
      name: queue_messages_ready
      selector:
        matchLabels:
          queue: "worker_tasks"
    target:
      type: AverageValue ## This needs to change accordingly. 
      averageValue: 0
  3. Horizontal Pod Autoscaler (HPA) creates new pods according to queued workers.

Option#2 KEDA (Kubernetes-based Event-driven Autoscaling)

Ref: https://blogs.halodoc.io/autoscaling-k8s-deployments-with-external-metrics/ https://keda.sh/docs/2.13/scalers/ https://keda.sh/docs/2.13/concepts/external-scalers/ https://keda.sh/docs/2.13/scalers/rabbitmq-queue/ https://keda.sh/docs/2.13/scalers/redis-cluster-lists/ https://keda.sh/docs/2.13/scalers/redis-lists/ https://keda.sh/docs/2.13/scalers/postgresql/

The PGSql scaler allows us to run a query on a database. Which means we can simply point it towards the existing conda-store database to get the queue depth of pending jobs.

Pros and Cons

Option#1

pt247 commented 4 months ago

Should this be part of conda-store?

Regardless of the option we take, this can be moved upstream to conda-store.

pt247 commented 4 months ago

We should agree on these before we start. Please suggest. Thanks.

dcmcand commented 4 months ago

@pt247 Conda store already has a queue, it is using redis and celery. I expect we can pull queue depth from that, so we shouldn't need to deploy extra infra there. The nebari-conda-store-redis-master stateful set is what you are looking for.

I am unfamiliar with KEDA, but it does look promising and has a redis scaler too. In general I prefer to use built in solutions as my default, so the horizontal autoscaler was my first thought, but if KEDA allows for better results with less complexity then I can see going with that. KEDA is a cncf project that seems to be actively maintained, so that is good.

As to whether this solution belongs in conda-store, I will simply say, it does not. Conda-store allows for horizontal scaling by having a queue with a worker pool. That is where conda-store's responsibility ends. Building specific implementation details for scaling on Nebari into conda-store would cross software boundaries and greatly increase coupling between the projects. That would be moving in the wrong direction. We want to decrease coupling between conda-store and Nebari. conda-store has a method for scaling horizontally, it is on Nebari to implement autoscaling that fits its particular environment.

Adam-D-Lewis commented 4 months ago

I bet conda store devs would have comments on this, and it would be implemented in Conda store. It seems like this issue should transferred to the conda store repo to improve visibility with conda store devs.

viniciusdc commented 4 months ago

We want to decrease coupling between conda-store and Nebari. conda-store has a method for scaling horizontally, it is on Nebari to implement autoscaling that fits its particular environment.

I also agree that the conda-store already has a sound scaling system; however, we are not using this on our own deployment. Having multiple celery workers is already supported (as both Redis and Celery handle the task load balancing by themselves); we need to discuss how to handle the worker scaling on our Kubernetes infrastructure.

It's a manual process that depends on creating more workers. We need a way to automate this process. I initially suggested using the queue depth on Redis to manage this, which would trigger a CRD to change the number of replicas the worker deployment should have.

dcmcand commented 4 months ago

Either KEDA or the horizontal autoscaler would work here and both can be used to scale automatically using the queue depth. I think that KEDA seems a bit more elegant with its implementation so would suggest using that to start to see if it works and if for some reason it doesn't, then falling back to the horizontal autoscaler.

pt247 commented 3 months ago

Notes on POC

Installing KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace dev

Scaled job spec:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaled-conda-worker
  namespace: dev
spec:
  scaleTargetRef:
    kind:          Deployment                               # Optional. Default: Deployment
    name:          nebari-conda-store-worker  # Mandatory. Must be in the same namespace as the ScaledObject
  triggers:
  - type: postgresql
    metadata:
      query: "SELECT COUNT(*) FROM build WHERE status='BUILDING' OR status='QUEUED';"
      targetQueryValue: "0"
      activationTargetQueryValue: "1"
      host: "nebari-conda-store-postgresql"
      userName: "postgres"
      password: "{nebari-conda-store-postgresql}"
      port: "5432"
      dbName: "conda-store"
      sslmode: disable

I have also tried this:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaled-conda-worker
  namespace: dev
spec:
  scaleTargetRef:
    kind:          Deployment                               # Optional. Default: Deployment
    name:          nebari-conda-store-worker  # Mandatory. Must be in the same namespace as the ScaledObject
  triggers:
  - type: postgresql
    metadata:
      query: "SELECT COUNT(*) FROM build WHERE status='BUILDING' OR status='QUEUED';"
      targetQueryValue: "0"
      activationTargetQueryValue: "1"
      host: "nebari-conda-store-postgresql.dev.svc.cluster.local"
      passwordFromEnv: PG_PASSWORD
      userName: "postgres"
      port: "5432"
      dbName: "conda-store"
      sslmode: disable
pt247 commented 3 months ago

I am getting the following error:

│ 2024-04-05T18:44:42Z    ERROR    Reconciler error    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind" │
│ : "ScaledObject", "ScaledObject": {"name":"scaled-conda-worker","namespace":"dev"}, "namespace": "dev", "name": "scaled-conda-work │
│ er", "reconcileID": "17f8e76e-7f9d-4e9e-90e4-77dde8a455d4", "error": "error establishing postgreSQL connection: failed to connect  │
│ to `host=nebari-conda-store-postgresql.dev.svc.cluster.local user=postgres database=conda-store`: server error (FATAL: password au │
│ thentication failed for user \"postgres\" (SQLSTATE 28P01))"}    
viniciusdc commented 3 months ago

Uhm, this is strange behavior; I think something might be missing... I will try to reproduce this on my side as well.

pt247 commented 3 months ago

I have also tried TriggerAuthentication:

apiVersion: v1
kind: Secret
metadata:
  name: conda-pg-credentials
  namespace: dev
type: Opaque
data:
  PG_PASSWORD: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: keda-trigger-auth-conda-secret
  namespace: dev
spec:
  secretTargetRef:
  - parameter: password
    name: conda-pg-credentials
    key: PG_PASSWORD
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaled-conda-worker
  namespace: dev
spec:
  scaleTargetRef:
    kind:          Deployment                 # Optional. Default: Deployment
    name:          nebari-conda-store-worker  # Mandatory. Must be in the same namespace as the ScaledObject
  triggers:
  - type: postgresql
    metadata:
      query: "SELECT COUNT(*) FROM build WHERE status='BUILDING' OR status='QUEUED';"
      targetQueryValue: "0"
      activationTargetQueryValue: "1"
      host: "nebari-conda-store-postgresql"
      userName: "postgres"
      port: "5432"
      dbName: "conda-store"
      sslmode: disable
    authenticationRef:
      name: keda-trigger-auth-conda-secret
pt247 commented 3 months ago

This worked:

It turns out that the secrets need to be base encoded.

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: trigger-auth-postgres
  namespace: dev
spec:
  secretTargetRef:
  - parameter: password
    name: nebari-conda-store-postgresql
    key: postgresql-password
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaled-conda-worker
  namespace: dev
spec:
  scaleTargetRef:
    kind: Deployment
    name: nebari-conda-store-worker
  triggers:
  - type: postgresql
    metadata:
      query: "SELECT COUNT(*) FROM build WHERE status='BUILDING' OR status='QUEUED';"
      targetQueryValue: "1"
      host: "nebari-conda-store-postgresql"
      userName: "postgres"
      port: "5432"
      dbName: "conda-store"
      sslmode: disable
    authenticationRef:
      name: trigger-auth-postgres
pt247 commented 3 months ago

Performance imporvements

We try and create 5 conda environments the fifth environment we add sciket-learn.

Current develop branch

Time: 5 minutes 11 seconds Number of conda-store workers: 1

Default KEDA

Time: 4 minutes 29 seconds Number of conda-store workers scaled to: 2

With min replica count set to 1 default is 0

Time: 2 minutes 35 seconds Number of conda-store workers scaled to: 2

With min replica count set to 1 default is 0 + Pooling interval of 15 seconds (default is 30 seconds)

Time: 4 minutes 14 seconds

pollingInterval: 5 and minimum replica count: 1 we track state building as well

  minReplicaCount: 1   # Default: 0
  pollingInterval: 5   # Default:  30 seconds
  cooldownPeriod: 60  
 time taken: 3:40
pt247 commented 1 month ago

We need a conda-store worker to be alive to start a Jupyter notebook as it depends on NFS-share. As we can not scale conda-store workers to zero. There is little cost benefit to making this change.

Screenshot 2024-06-11 at 22 59 15

Additionally, as observed in the PR comment, we can not scale up the conda-store workers beyond the general node. I have closed the PR, and this ticket can stay in the backlog until we figure out a better way of scaling conda-store-worker beyond the general node.