kubecost / cluster-turndown

Automated turndown of Kubernetes clusters on specific schedules.
Apache License 2.0
259 stars 23 forks source link

This feature is buggy and ultimately doesn't work at all. #77

Open mmclane opened 3 months ago

mmclane commented 3 months ago

I may have opened this issue in the wrong place. https://github.com/opencost/opencost/issues/2752

When I try to create a turndown schedule via the UI it fails. I just tried to create one just now. Its currently 10:13, I selected a start time for 10:15 with an end time of 11. I get no error when I click apply but the schedule never shows up. If run kubectl get tds I see the state is ScheduleFailed. If I describe it however it says it successfully scheduled the turndown. It doesn't say why it failed.

If I delete it and create a new one via the following manifest file it is scheduled successfully.

apiVersion: kubecost.com/v1alpha1
kind: TurndownSchedule
metadata:
  name: test-turndown
  namespace: kubecost
  finalizers:
  - "finalizer.kubecost.com"
spec:
  start: 2024-05-23T14:25:00Z
  end: 2024-05-23T15:15:00Z
  repeat: none

The schedule shows up in the UI, but shows the wrong configuration. In the UI it says it will Repeat: Daily but you can see in the manifest file that it was not set that way. If I describe the job it shows the following and says Repeat: none.

Name:         test-turndown
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  kubecost.com/v1alpha1
Kind:         TurndownSchedule
Metadata:
  Creation Timestamp:  2024-05-23T14:23:25Z
  Finalizers:
    finalizer.kubecost.com
  Generation:        1
  Resource Version:  88587297
  UID:               aefdf0fc-e27c-41b5-93c1-2d14b41ce440
Spec:
  End:     2024-05-23T15:15:00Z
  Repeat:  none
  Start:   2024-05-23T14:25:00Z
Status:
  Current:               scaledown
  Last Updated:          2024-05-23T14:23:25Z
  Next Scale Down Time:  2024-05-23T14:25:00Z
  Next Scale Up Time:    2024-05-23T15:15:00Z
  Scale Down Id:         a59fe5ec-53c9-4f63-aee4-abad3903a126
  Scale Down Metadata:
    Repeat:     none
    Type:       scaledown
  Scale Up ID:  cfbd3922-5f2a-43cd-a46c-dd9497fd02ce
  Scale Up Metadata:
    Repeat:  none
    Type:    scaleup
  State:     ScheduleSuccess
Events:
  Type    Reason                   Age   From                          Message
  ----    ------                   ----  ----                          -------
  Normal  ScheduleTurndownSuccess  116s  turndown-schedule-controller  Successfully scheduled turndown

The schedule no longer shows in the UI once it starts.

Additionally, once the schedule starts it successfully created the cluster-turndown node group and the new node get added to the cluster. The cluster-controller pod moves over to the new node, but that is as far as it gets. Looking at the logs on the cluster-controller I see errors and restarts.

2024-05-15T18:25:54Z INF Determined to be running in a cluster. Using in-cluster K8s config.
2024-05-15T18:30:01Z ERR Kubescaler setup failed error="creating a Kubescaler: recommendation service unavailable: failed to execute request: Get \"http://kubecost-cost-analyzer.kubecost:9090/model/savings/requestSizingV2\": context deadline exceeded"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23f31fe]

goroutine 1 [running]:
main.main()
    /app/cmd/clustercontroller/main.go:237 +0x53e
2024-05-15T18:30:43Z INF Determined to be running in a cluster. Using in-cluster K8s config.

I am running this on an EKS cluster.

To Reproduce Steps to reproduce the behavior:

  1. Install kubecost
  2. Configure the cluster-controller-service-key secret as described int he documentation
  3. Enable cluster-controller
  4. Schedule a TurndownSchedule using a yaml file
  5. Wait for the start time to happen.

Expected behavior The cluster should be turned down

Which version of OpenCost are you using? You can find the version from the container's startup logging or from the bottom of the page in the UI. Helm Chart v2.2.4

kwombach12 commented 3 months ago

@mmclane Thanks for flagging this! I am going to spend some time today trying to identify the source of this issue!

mmclane commented 3 months ago

@mmclane Thanks for flagging this! I am going to spend some time today trying to identify the source of this issue!

Let me know how I might be able to help.