kubecost / cluster-turndown

Automated turndown of Kubernetes clusters on specific schedules.
Apache License 2.0
259 stars 23 forks source link

panic: runtime error: invalid memory address or nil pointer dereference #28

Closed itninja-hue closed 3 years ago

itninja-hue commented 4 years ago

LOGS: clusterprovider.go:84] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider eksclusterprovider.go:91] [Error] Failed to load service account. provider.go:48] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider validator.go:39] Validating Provider panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x16df68a] goroutine 39 [running]: github.com/kubecost/cluster-turndown/pkg/turndown/provider.(*EKSProvider).GetNodePools(0xc0004f54a0, 0x0, 0x0, 0x0, 0x0, 0x0) /app/pkg/turndown/provider/eksprovider.go:52 +0x2a github.com/kubecost/cluster-turndown/pkg/turndown/provider.validateProvider(0x1fa4480, 0xc0004f54a0, 0x5, 0xc0004f6cc0) /app/pkg/turndown/provider/validator.go:15 +0x72 created by github.com/kubecost/cluster-turndown/pkg/turndown/provider.Validate /app/pkg/turndown/provider/validator.go:42 +0xbd

using : https://github.com/kubecost/cluster-turndown/releases/latest/download/cluster-turndown-full.yaml

cluster: EKS

theory: maybe this function is returning null https://github.com/kubecost/cluster-turndown/blob/fcabe35abe4f2ecea66bf90f4c46bdcece85057e/pkg/cluster/provider/awsclusterprovider.go#L479

mbolt35 commented 4 years ago

@itninja-hue Thanks for the report! It looks like the service account isn't actually loading correctly.

[eksclusterprovider.go:91] [Error] Failed to load service account.

However, the validator shouldn't proceed past that error, which is certainly a bug on our end. To resolve the service key failure, ensure that the cluster-turndown-service-key secret exists in the turndown namespace:

kubectl get secret cluster-turndown-service-key -n turndown

For more information, the following has instructions: https://github.com/kubecost/cluster-turndown#eks--aws-kops-setup

itninja-hue commented 4 years ago

the secret does exist in the turndown namespace image Can you add more verbosity in the logs regarding validation, like for example service account can't be found or can't connect or missing privileges ? that would be very helpful, :D

BTW: if secret cluster-turndown-service-key is not found the pod will enter in a loop back crash since its mounted inside the pod as a volume.

itninja-hue commented 4 years ago

@mbolt35 i fixed that error i had some problems with kube-seal hashing now moved to this

clusterprovider.go:84] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider
eksclusterprovider.go:106] [Error] AccessDeniedException: status code: 403, request id: ...
provider.go:48] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider
validator.go:39] Validating Provider
panic: runtime error: invalid memory address or nil pointer dereference

it jumps again to panic error maybe you should also note that when u fixing the other bug

mbolt35 commented 4 years ago

Yeah, the validation was somewhat older/legacy code that just didn't quite standup to it's more modern counterparts.

Another issue that was filed today is related to these changes as well: https://github.com/kubecost/cluster-turndown/issues/27 -- It looks like we need to support the AWS authentication chain rather than directly requiring the env vars to be set. It will still require that a secret exist like you mentioned, but it will proceed even if the secret is empty - we'll let validation communicate the errors, if any. Thanks again for your feedback!

mbolt35 commented 4 years ago

@itninja-hue Let me verify that we've correctly documented the required permissions for EKS. Given that we're using another API, it may require EKS specific permissions rather than just AutoScaling. I'll investigate more and let you know. The EKS cluster provider works a bit different than the other providers (initialization makes connections between EKS node groups and autoscaling node groups), and the validation doesn't like the EKS provider being nil.

With regards to what we'll need to address:

What do you think?

itninja-hue commented 4 years ago

yes i think that will be perfect :D, if your algorithm requires talking to eks master nodes it will require more priv though i think that will be, if not set right, kind of dangerous

mbolt35 commented 4 years ago

The following pull request should address the nil panic: https://github.com/kubecost/cluster-turndown/pull/29

This change should provide better error reporting on initialization failure. The code no longer fails immediately when a secret isn't present (the validation step should catch auth issues).

The new image for this branch is: gcr.io/kubecost1/cluster-turndown:1.3-SNAPSHOT if you want to give it a shot.

In addition, I looked into the EKS specific permission sets, which apparently require more policy updates.

I have not had time to test, but it looks like in addition to the AutoScalingFullAccess policy (EKS NodeGroups can only resize to 1, so we still require ASG access to resize the node groups to 0), the eks specific policy for "all" would look like the following:

{
    "Effect": "Allow",
    "Action": "eks:*",
    "Resource": "*"
},

After looking through the code, the specific calls used in the implementation appear to be scoped to the following:

{
    "Effect": "Allow",
    "Action": [
        "eks:ListClusters",
        "eks:DescribeCluster",
        "eks:DescribeNodegroup",
        "eks:ListNodegroups",
        "eks:CreateNodegroup",
        "eks:UpdateClusterConfig",
        "eks:UpdateNodegroupConfig",
        "eks:DeleteNodegroup",
        "eks:ListTagsForResource",
        "eks:TagResource",
        "eks:UntagResource"
    ],
    "Resource": "*"
}

The documentation for those are here: https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonelasticcontainerserviceforkubernetes.html#amazonelasticcontainerserviceforkubernetes-cluster

itninja-hue commented 4 years ago

i am gonna test it and get back with feedbacks but before i do, is your algorithm preserves node labels ?, i am defining and managing them on node group Terraform provider level. if not none of my pods will schedule

mbolt35 commented 4 years ago

Hey @itninja-hue! Apologies, I missed your last question. When we perform scaledown, we actually resize the node group to 0 instead of deleting. In other words, any tags that exist on those node groups (that are configured to propagate to node labels) will continue to show up on the nodes created by that node group on scale up.

Let me know if this doesn't answer your question accurately!

mbolt35 commented 3 years ago

Fix merged