Closed itninja-hue closed 3 years ago
@itninja-hue Thanks for the report! It looks like the service account isn't actually loading correctly.
[eksclusterprovider.go:91] [Error] Failed to load service account.
However, the validator shouldn't proceed past that error, which is certainly a bug on our end. To resolve the service key failure, ensure that the cluster-turndown-service-key
secret exists in the turndown namespace:
kubectl get secret cluster-turndown-service-key -n turndown
For more information, the following has instructions: https://github.com/kubecost/cluster-turndown#eks--aws-kops-setup
the secret does exist in the turndown namespace Can you add more verbosity in the logs regarding validation, like for example service account can't be found or can't connect or missing privileges ? that would be very helpful, :D
BTW: if secret cluster-turndown-service-key is not found the pod will enter in a loop back crash since its mounted inside the pod as a volume.
@mbolt35 i fixed that error i had some problems with kube-seal hashing now moved to this
clusterprovider.go:84] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider
eksclusterprovider.go:106] [Error] AccessDeniedException: status code: 403, request id: ...
provider.go:48] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider
validator.go:39] Validating Provider
panic: runtime error: invalid memory address or nil pointer dereference
it jumps again to panic error maybe you should also note that when u fixing the other bug
Yeah, the validation was somewhat older/legacy code that just didn't quite standup to it's more modern counterparts.
Another issue that was filed today is related to these changes as well: https://github.com/kubecost/cluster-turndown/issues/27 -- It looks like we need to support the AWS authentication chain rather than directly requiring the env vars to be set. It will still require that a secret exist like you mentioned, but it will proceed even if the secret is empty - we'll let validation communicate the errors, if any. Thanks again for your feedback!
@itninja-hue Let me verify that we've correctly documented the required permissions for EKS. Given that we're using another API, it may require EKS specific permissions rather than just AutoScaling. I'll investigate more and let you know. The EKS cluster provider works a bit different than the other providers (initialization makes connections between EKS node groups and autoscaling node groups), and the validation doesn't like the EKS provider being nil.
With regards to what we'll need to address:
What do you think?
yes i think that will be perfect :D, if your algorithm requires talking to eks master nodes it will require more priv though i think that will be, if not set right, kind of dangerous
The following pull request should address the nil panic: https://github.com/kubecost/cluster-turndown/pull/29
This change should provide better error reporting on initialization failure. The code no longer fails immediately when a secret isn't present (the validation step should catch auth issues).
The new image for this branch is: gcr.io/kubecost1/cluster-turndown:1.3-SNAPSHOT
if you want to give it a shot.
In addition, I looked into the EKS specific permission sets, which apparently require more policy updates.
I have not had time to test, but it looks like in addition to the AutoScalingFullAccess policy (EKS NodeGroups can only resize to 1, so we still require ASG access to resize the node groups to 0), the eks specific policy for "all" would look like the following:
{
"Effect": "Allow",
"Action": "eks:*",
"Resource": "*"
},
After looking through the code, the specific calls used in the implementation appear to be scoped to the following:
{
"Effect": "Allow",
"Action": [
"eks:ListClusters",
"eks:DescribeCluster",
"eks:DescribeNodegroup",
"eks:ListNodegroups",
"eks:CreateNodegroup",
"eks:UpdateClusterConfig",
"eks:UpdateNodegroupConfig",
"eks:DeleteNodegroup",
"eks:ListTagsForResource",
"eks:TagResource",
"eks:UntagResource"
],
"Resource": "*"
}
The documentation for those are here: https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonelasticcontainerserviceforkubernetes.html#amazonelasticcontainerserviceforkubernetes-cluster
i am gonna test it and get back with feedbacks but before i do, is your algorithm preserves node labels ?, i am defining and managing them on node group Terraform provider level. if not none of my pods will schedule
Hey @itninja-hue! Apologies, I missed your last question. When we perform scaledown, we actually resize the node group to 0 instead of deleting. In other words, any tags that exist on those node groups (that are configured to propagate to node labels) will continue to show up on the nodes created by that node group on scale up.
Let me know if this doesn't answer your question accurately!
Fix merged
LOGS: clusterprovider.go:84] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider eksclusterprovider.go:91] [Error] Failed to load service account. provider.go:48] Found ProviderID starting with "aws" and eks nodegroup, using EKS Provider validator.go:39] Validating Provider panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x16df68a] goroutine 39 [running]: github.com/kubecost/cluster-turndown/pkg/turndown/provider.(*EKSProvider).GetNodePools(0xc0004f54a0, 0x0, 0x0, 0x0, 0x0, 0x0) /app/pkg/turndown/provider/eksprovider.go:52 +0x2a github.com/kubecost/cluster-turndown/pkg/turndown/provider.validateProvider(0x1fa4480, 0xc0004f54a0, 0x5, 0xc0004f6cc0) /app/pkg/turndown/provider/validator.go:15 +0x72 created by github.com/kubecost/cluster-turndown/pkg/turndown/provider.Validate /app/pkg/turndown/provider/validator.go:42 +0xbd
using : https://github.com/kubecost/cluster-turndown/releases/latest/download/cluster-turndown-full.yaml
cluster: EKS
theory: maybe this function is returning null https://github.com/kubecost/cluster-turndown/blob/fcabe35abe4f2ecea66bf90f4c46bdcece85057e/pkg/cluster/provider/awsclusterprovider.go#L479