kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.92k stars 4.65k forks source link

nodeup.sh configured incorrect executable for ExecStart value with --install-systemd-unit #15257

Closed dreemoutloud closed 1 year ago

dreemoutloud commented 1 year ago

/kind bug

1. What kops version are you running? The command kops version, will display this information. Client version: 1.25.3 (git-v1.25.3)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

bin/kubectl version --short

Client Version: v1.25.0 Kustomize Version: v4.5.7 Server Version: v1.25.4

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? Created a new instance group using a latest AMI. This AMI happens to have a monitoring solution (Dynatrace) installed on it in /opt, running as a systemd service. When the nodeup.sh is run as part of the ec2 user_data, the EC2 User_data (accessible on the server as /var/lib/cloud/instance/scripts/nodeup.sh): specifies this call:

echo "Running nodeup"
  # We can't run in the foreground because of https://github.com/docker/docker/issues/23793
  ( cd ${INSTALL_DIR}/bin; ./nodeup --install-systemd-unit --conf=${INSTALL_DIR}/conf/kube_env.yaml --v=8; )

After this, the created systemd service is called to start nodeup and thus kops. However, it fails.

5. What happened after the commands executed? The startup fails with the following message:

Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007226    1370 executor.go:186] Executing task "Service/kops-configuration.service": Service: kops-configuration.service
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007374    1370 changes.go:81] Field changed "Definition" actual="<nil>" expected="[Unit]\nDescription=Run kOps bootstrap (nodeup)\nDocumentation=https://github.com/kubernetes/kops\n\n[Service]\nEnvironmentFile=/etc/sysconfig/kops-configuration\nEnvironmentFile=/etc/environment\nExecStart=/opt/dynatrace/oneagent/agent/lib64/oneagentdynamizer --conf=/opt/kops/conf/kube_env.yaml --v=8\nType=oneshot\n\n[Install]\nWantedBy=multi-user.target\n"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007432    1370 changes.go:81] Field changed "Running" actual="false" expected="true"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007460    1370 changes.go:81] Field changed "Enabled" actual="<nil>" expected="true"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007487    1370 changes.go:81] Field changed "ManageState" actual="<nil>" expected="true"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007513    1370 changes.go:81] Field changed "SmartRestart" actual="<nil>" expected="true"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007642    1370 files.go:57] Writing file "/lib/systemd/system/kops-configuration.service"
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007765    1370 files.go:113] Changing file mode for "/lib/systemd/system/kops-configuration.service" to -rw-r--r--
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.007805    1370 service.go:287] Reloading systemd configuration
Mar 20 08:44:03 ip-10-51-43-72 systemd[1]: Reloading.
Mar 20 08:44:03 ip-10-51-43-72 cloud-init[1345]: I0320 08:44:03.265028    1370 service.go:350] Restarting service "kops-configuration.service"
Mar 20 08:44:03 ip-10-51-43-72 systemd[1]: Starting Run kOps bootstrap (nodeup)...
Mar 20 08:44:03 ip-10-51-43-72 systemd[1]: kops-configuration.service: Main process exited, code=exited, status=64/USAGE
Mar 20 08:44:03 ip-10-51-43-72 oneagentdynamizer[1422]: OneAgentDynamizer - file not found: '--conf=/opt/kops/conf/kube_env.yaml'
Mar 20 08:44:03 ip-10-51-43-72 systemd[1]: kops-configuration.service: Failed with result 'exit-code'.
Mar 20 08:44:03 ip-10-51-43-72 systemd[1]: Failed to start Run kOps bootstrap (nodeup).

Examining the created system files:

# cat /lib/systemd/system/kops-configuration.service 
[Unit]
Description=Run kOps bootstrap (nodeup)
Documentation=https://github.com/kubernetes/kops

[Service]
EnvironmentFile=/etc/sysconfig/kops-configuration
EnvironmentFile=/etc/environment
**ExecStart=/opt/dynatrace/oneagent/agent/lib64/oneagentdynamizer --conf=/opt/kops/conf/kube_env.yaml --v=8**
Type=oneshot

[Install]
WantedBy=multi-user.target

cat /etc/sysconfig/kops-configuration

AWS_REGION=eu-west-1

So the --install-systemd-unit flag is creating files that specify "/opt/dynatrace/oneagent/agent/lib64/oneagentdynamizer" in the unit file rather than "/opt/kops/bin/nodeup".

6. What did you expect to happen? The kops-configuration service will start kubelet and the node will join the cluster; the file /lib/systemd/system/kops-configuration.service should specify ExecStart=/opt/kops/bin/nodeup

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

kops get --name $name --state $state -o yaml
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: sit.MyEnvironment.MyCompany.co.uk
spec:
  api:
    loadBalancer:
      class: Network
      securityGroupOverride: sg-REDACTED
      subnets:
      - name: eu-west-1a
      - name: eu-west-1b
      - name: eu-west-1c
      type: Internal
  authorization:
    rbac: {}
  awsLoadBalancerController:
    enabled: true
  certManager:
    enabled: true
  channel: stable
  cloudProvider: aws
  clusterAutoscaler:
    cpuRequest: 100m
    enabled: true
    expander: least-waste
  configBase: s3://MyCompany-sit-kops-state/sit.MyEnvironment.MyCompany.co.uk
  containerRuntime: containerd
  containerd: {}
  docker:
    insecureRegistry: docker-MyInstance.MyEnvironment.MyCompany.co.uk:20021
    logDriver: ""
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: t1-AWSAZa-MyEnvironment-k8mast
      name: a
    - encryptedVolume: true
      instanceGroup: t1-AWSAZb-MyEnvironment-k8mast
      name: b
    - encryptedVolume: true
      instanceGroup: t1-AWSAZc-MyEnvironment-k8mast
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: t1-AWSAZa-MyEnvironment-k8mast
      name: a
    - encryptedVolume: true
      instanceGroup: t1-AWSAZb-MyEnvironment-k8mast
      name: b
    - encryptedVolume: true
      instanceGroup: t1-AWSAZc-MyEnvironment-k8mast
      name: c
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::AWSAccountID:policy/MyEnvironment-nodes
    - arn:aws:iam::AWSAccountID:policy/MyEnvironment-node-policy
    - arn:aws:iam::AWSAccountID:policy/node-alb-policy
    node:
    - arn:aws:iam::AWSAccountID:policy/MyEnvironment-node-policy
    - arn:aws:iam::AWSAccountID:policy/MyEnvironment-nodes
    - arn:aws:iam::aws:policy/AutoScalingFullAccess
    - arn:aws:iam::AWSAccountID:policy/node-alb-policy
    - arn:aws:iam::aws:policy/AmazonS3FullAccess
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    enableAdmissionPlugins:
    - PodNodeSelector
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - NodeRestriction
    - ResourceQuota
  kubeDNS:
    provider: CoreDNS
  kubeProxy:
    enabled: true
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.25.4
  masterPublicName: api.MyEnvironment.MyCompany.co.uk
  metricsServer:
    enabled: true
    insecure: true
  networkCIDR: CIDRrange/21
  networkID: vpc-REDACTED
  networking:
    calico:
      typhaReplicas: 3
  nodeTerminationHandler:
    enableSQSTerminationDraining: true
    enabled: true
    managedASGTag: aws-node-termination-handler/managed
  nonMasqueradeCIDR: REDACTED/10
  sshAccess:
  - CIDRrange
  - CIDRrange
  - CIDRrange
  - CIDRrange
  subnets:
  - cidr: CIDRrange
    id: subnet-REDACTED
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: CIDRrange
    id: subnet-REDACTED
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: CIDRrange
    id: subnet-REDACTED
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---

(Some other Instance group definitions)

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: sit.MyEnvironment.MyCompany.co.uk
  name: sit.MyEnvironment.MyCompany.co.uk
spec:
  additionalUserData:
  - content: |
      REDACTED
  cloudLabels:
    EnvironmentType: 
    HostingEnv: 
    OS Version: Ubuntu20 (focal)
    Product: 
    Software Product: 
    Technical Services: 
    dedicated: amitest
    epaas_class: k8
    epassVar: 
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/sit.MyCompany: ""
  image: REDACTED
  instanceMetadata:
    httpPutResponseHopLimit: 3
    httpTokens: optional
  machineType: t3.medium
  maxSize: 2
  minSize: 1
  nodeLabels:
    dedicated: amitest
    kops.k8s.io/instancegroup: sit.MyEnvironment.MyCompany.co.uk
  role: Node
  securityGroupOverride: sg-REDACTED
  subnets:
  - eu-west-1a
  - eu-west-1b
  taints:
  - dedicated=amitest:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know? Yes. If you rename the file /opt/dynatrace/oneagent/agent/lib64/oneagentdynamizer to /opt/dynatrace/oneagent/agent/lib64/oneagentdynamizer.hide and rerun the nodeup command, it works... We have run fsck to confirm there are no inode or filesystem corruptions.

hakman commented 1 year ago

kOps uses /proc/self/exe to find the running binary path, which is pretty much what Go uses also for os.Executable. It seems that Dynatrace intercepts exec calls and instead it runs everything through its agent, which changes the binary path.

dreemoutloud commented 1 year ago

That sounds plausible. Does this mean the two programs are therefore simply incompatible? Is there a fix or a workaround?

dreemoutloud commented 1 year ago

Also, has this situation changed since v1.22? We did not experience this issue at that level. (Also- thank you)

On 21 Mar 2023, at 06:47, Richard Heasman @.***> wrote:

 That sounds plausible. Does this mean the two programs are therefore simply incompatible? Is there a fix or a workaround?

On 21 Mar 2023, at 06:16, Ciprian Hacman @.***> wrote:



kOps uses /proc/self/exe to find the running binary path, which is pretty much what Go uses also for os.Executablehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcs.opensource.google%2Fgo%2Fgo%2F%2B%2Frefs%2Ftags%2Fgo1.20.2%3Asrc%2Fos%2Fexecutable_procfs.go%3Bl%3D20%3Bdrc%3Dda4687923b8c2d42c23f61fa3db9f4d3ce0c5f54&data=05%7C01%7C%7C7c67b823eefc4e8ae98608db29d3cf3f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638149761913078554%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=MWzcgd5%2BQ24IZebdPRDX7Y27dUhzh9Zz9h0bP0urZQs%3D&reserved=0. It seems that Dynatrace intercepts exec calls and instead it runs everything through its agent, which changes the binary path.

— Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkops%2Fissues%2F15257%23issuecomment-1477334522&data=05%7C01%7C%7C7c67b823eefc4e8ae98608db29d3cf3f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638149761913078554%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ayxd9ECJ652Bo16FumHo6nxL7gjHKGKrzQ92%2BJIjykM%3D&reserved=0, or unsubscribehttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALBMYULQQKAKP5LV655QEADW5FBTZANCNFSM6AAAAAAWA4EZCU&data=05%7C01%7C%7C7c67b823eefc4e8ae98608db29d3cf3f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638149761913234754%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zwD4Tyi%2FSqdN5WywTp%2Fcdg9nmPVZZkD%2FJinwu0kvJJw%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

hakman commented 1 year ago

In theory, this could be fixed by passing the binary path as arg when installing the kops-configuration.service. For this to happen, it would have to be agreed by the other maintainers. As for workarounds, I don't see anything on kOps side. Maybe there's something in Dynatrace that can disable it for specific binaries or services. Can't say that I'm a fan of their approach here.

dreemoutloud commented 1 year ago

Good shout. In tandem with this, I have already contacted Dynatrace support - they have highlighted the following: https://www.dynatrace.com/support/help/technology-support/application-software/go/support/go-known-limitations#side-effects At present I think it best we pursuing this solution (configuring a custom exclusion on dynatrace) and will update you as to the results; so no need at present to code a binary path arg. Will keep you in the loop.

johngmyers commented 1 year ago

Per Office Hours, we don't have a problem with passing this explicitly to the unit file as long as we have the information.

dreemoutloud commented 1 year ago

Dynatrace have confirmed this is due to their Static Go application monitoring. https://www.dynatrace.com/support/help/shortlink/go-known-limitations#side-effects We have turned this off and the issue is gone. Thank you for your time anyway.