kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.95k stars 4.65k forks source link

Better output from kops rolling-update cluster command #14122

Open UncleEricB opened 2 years ago

UncleEricB commented 2 years ago

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see. There are multiple reasons a k8s node can be in NeedsUpdate state. I want a more focused explanation of the trigger for nodes in an InstanceGroup being in NeedsUpdate state when kops rolling-update cluster is run, possibly at a verbosity around 4.

The reason for this request is that there are multiple (four) triggers for a node being in a NeedsUpdate state. That documentation doesn't clearly state how to check those possible causes. I guess "The instance was created with a specification that is older" refers to Launch Template versions? Maybe "The instance was detached" refers to a cordon Taint?

This will speed up debugging and improve uptime. It will also expand the pool of SREs capable of debugging as not everyone has the same level of kOps/k8s expertise.

2. Feel free to provide a design supporting your feature request. Preferred Output $ kops rolling-update cluster cactus-1-23.k8s.sproutsocial.com --state s3://infra-kops-state -v4 ~/sandbox/sprout_development_env/NeedsUpdateChecker I0812 11:52:07.404391 4005 factory.go:68] state store s3://infra-kops-state ...snip... I0812 11:52:10.825012 4005 aws_cloud.go:1551] Querying EC2 for all valid zones in region "us-east-1" I0812 11:52:10.826233 4005 request_logger.go:45] AWS request: ec2/DescribeAvailabilityZones I0812 11:52:11.322863 4005 aws_cloud.go:629] Listing all Autoscaling groups matching cluster tags I0812 11:52:11.324043 4005 request_logger.go:45] AWS request: autoscaling/DescribeTags I0812 11:52:11.841028 4005 request_logger.go:45] AWS request: autoscaling/DescribeAutoScalingGroups I0812 11:52:12.022521 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:12.023747 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:12.141730 4005 aws_cloud.go:762] Launch Template Version used for compare: "3" I0812 11:52:12.141732 4005 aws_cloud.go:764] InstanceGroup nodes-us-east-1a nodes Launch Template are behind! I0812 11:52:14.051511 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:14.051654 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:14.178106 4005 aws_cloud.go:762] Launch Template Version used for compare: "4" I0812 11:52:14.178108 4005 aws_cloud.go:765] InstanceGroup nodes-us-east-1b nodes have a Cordon Taint! I0812 11:52:14.532158 4005 aws_cloud.go:743] Launch Template Version Specified By ASG: $Latest I0812 11:52:14.532365 4005 request_logger.go:45] AWS request: ec2/DescribeLaunchTemplates I0812 11:52:14.647179 4005 aws_cloud.go:762] Launch Template Version used for compare: "4" I0812 11:52:14.647181 4005 aws_cloud.go:766] InstanceGroup nodes-us-east-1d nodes have needs-update annotation ...snip...

--or even-- NAME STATUS NEEDUPDATE READY MIN TARGET MAX NODES REASON master-us-east-1a Ready 0 1 1 1 1 1 master-us-east-1b Ready 0 1 1 1 1 1 master-us-east-1d Ready 0 1 1 1 1 1 nodes-us-east-1a NeedsUpdate 2 0 2 2 2 2 Launch Template version nodes-us-east-1b NeedsUpdate 2 0 2 2 2 2 Cordon Taint nodes-us-east-1d NeedsUpdate 2 0 2 2 2 2 kops.k8s.io/needs-update

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

olemarkus commented 1 year ago

I think these are good suggestions, but probably hard to prioritise for most of the maintainers. It should however be low-hanging fruit for new contributors.

johngmyers commented 1 year ago

The places that need this logging:

func (group *CloudInstanceGroup) AdjustNeedUpdate() {

func getCloudGroups(c GCECloud, cluster *kops.Cluster, instancegroups []*kops.InstanceGroup, warnUnmatched bool, nodes []v1.Node) (map[string]*cloudinstances.CloudInstanceGroup, error) {

func awsBuildCloudInstanceGroup(c AWSCloud, cluster *kops.Cluster, ig *kops.InstanceGroup, g *autoscaling.Group, nodeMap map[string]*v1.Node) (*cloudinstances.CloudInstanceGroup, error) {

and any place that assigns the value CloudInstanceStatusNeedsUpdate

ShivamTyagi12345 commented 1 year ago

/assign

I would be taking this issue @olemarkus

olemarkus commented 1 year ago

Thanks for that.

I suggest writing user-facing text directly to stdout and not go through klog. The remaining klog lines could go through -v2.

ShivamTyagi12345 commented 1 year ago

@olemarkus I have difficulty understanding what needs to be done in order to complete this task. Can you please break it down into steps

olemarkus commented 1 year ago

The information that users should read should just be outputted with fmt.Printf(). The things that are less useful should use e.g klog.V(2).Infof().

vaibhav2107 commented 1 year ago

/remove-lifecycle stale