kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.95k stars 4.65k forks source link

Hubble relay cannot reach agents #15625

Closed zadjadr closed 1 year ago

zadjadr commented 1 year ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.26.4 (git-v1.26.4)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

{
  "clientVersion": {
    "major": "1",
    "minor": "27",
    "gitVersion": "v1.27.3",
    "gitCommit": "25b4e43193bcda6c7328a6d147b1fb73a33f1598",
    "gitTreeState": "archive",
    "buildDate": "2023-06-15T16:18:27Z",
    "goVersion": "go1.20.5",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.1",
  "serverVersion": {
    "major": "1",
    "minor": "27",
    "gitVersion": "v1.27.3",
    "gitCommit": "25b4e43193bcda6c7328a6d147b1fb73a33f1598",
    "gitTreeState": "clean",
    "buildDate": "2023-06-14T09:47:40Z",
    "goVersion": "go1.20.5",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

3. What cloud provider are you using?

openstack

4. What commands did you run? What is the simplest way to reproduce this issue?

I added cilium as my CNI and enabled hubble, here are are settings that might be of interest:

certManager:
  enabled: true
  managed: false
kubeProxy:
  enabled: false
networking:
  cilium:
    clusterName: zcluster.k8s.local
    enablePrometheusMetrics: true
    etcdManaged: true
    enableBPFMasquerade: true
    hubble:
      enabled: true
    enableEncryption: true
    enableL7Proxy: true
    encryptionType: ipsec
    enableNodePort: false

5. What happened after the commands executed?

Before creating a new Certificate (see 9):

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.35.80:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for *.zcluster.k8s.local.hubble-grpc.cilium.io, not nodes-es1-xpgdr4.zcluster-k8s-local.hubble-grpc.cilium.io\"" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/nodes-es1-xpgdr4 subsys=hubble-relay

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.33.93:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for *.zcluster.k8s.local.hubble-grpc.cilium.io, not control-plane-es1-5h61eg.zcluster-k8s-local.hubble-grpc.cilium.io\"" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/control-plane-es1-5h61eg subsys=hubble-relay

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.65.23:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for *.zcluster.k8s.local.hubble-grpc.cilium.io, not control-plane-ix1-4s9cfv.zcluster-k8s-local.hubble-grpc.cilium.io\"" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/control-plane-ix1-4s9cfv subsys=hubble-relay

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.64.47:4244" error="context deadline exceeded" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/nodes-ix1-rh0gdr subsys=hubble-relay

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.96.142:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for *.zcluster.k8s.local.hubble-grpc.cilium.io, not control-plane-ix2-uojd4q.zcluster-k8s-local.hubble-grpc.cilium.io\"" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/control-plane-ix2-uojd4q subsys=hubble-relay

hubble-relay-556f694674-t62k2 hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.99.169:4244" error="context deadline exceeded" hubble-tls=true next-try-in=1h30m0s peer=zcluster.k8s.local/nodes-ix2-voavso subsys=hubble-relay

Before adding the security group (see 9):

hubble-relay-556f694674-rswjg hubble-relay level=info msg=Connecting address="10.150.64.47:4244" hubble-tls=true peer=zcluster.k8s.local/nodes-ix1-rh0gdr subsys=hubble-relay
hubble-relay-556f694674-rswjg hubble-relay level=warning msg="Failed to create gRPC client" address="10.150.99.169:4244" error="context deadline exceeded" hubble-tls=true next-try-in=2m40s peer=zcluster.k8s.local/nodes-ix2-voavso subsys=hubble-relay

6. What did you expect to happen?

I expected hubble-relay to be able to connect without any issues.

The certificate should create the correct dns name.

The other missing puzzle piece was, that port 4244 (TCP/UDP) needs to be added to the security groups of all nodes, when hubble is activated. So for now, I've created an addicional security group cilium.hubble.zcluster.k8s.local.

7. Please provide your cluster manifest.

Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.**

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: zcluster.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  kubeDNS:
    provider: CoreDNS
    nodeLocalDNS:
      enabled: true
      memoryRequest: 5Mi
      cpuRequest: 25m
  certManager:
    # Needed for networking.cilium.hubble
    enabled: true
    defaultIssuer: letsencrypt-staging
    managed: false
  cloudConfig:
    openstack:
      blockStorage:
        bs-version: v3
        clusterName: zcluster.k8s.local
        ignore-volume-az: false
        csiTopologySupport: true
        # Needed to overwrite the image provided by kOps for Kubernetes 1.27.3
        # Can be removed when kOps updates
        csiPluginImage: registry.k8s.io/provider-os/cinder-csi-plugin:1.27.1@sha256:2eac35583ad803d5576663d8a2841653023ae39a3f6aa7d2cbd633b5aeabe998
      loadbalancer:
        floatingNetwork: provider
        floatingNetworkID: XXX
        method: ROUND_ROBIN
        provider: amphora
        useOctavia: true
      monitor:
        delay: 15s
        maxRetries: 3
        timeout: 10s
      router:
        dnsServers: 208.67.222.222,208.67.220.220,1.1.1.1
        externalNetwork: provider
  cloudControllerManager:
    clusterName: zcluster.k8s.local
    # Needed to overwrite the image provided by kOps for Kubernetes 1.27.3
    # Can be removed when kOps on later updates
    image: registry.k8s.io/provider-os/openstack-cloud-controller-manager:v1.27.1@sha256:8ed6967effb4ab4cf0ae2eabadacb24be804d5c2de2fd393ead97c47d8949485
  cloudProvider: openstack
  configBase: swift://kops/zcluster.k8s.local
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - instanceGroup: control-plane-ix1
          name: ix1
          volumeType: high-iops
        - instanceGroup: control-plane-ix2
          name: ix2
          volumeType: high-iops
        - instanceGroup: control-plane-es1
          name: es1
          volumeType: high-iops
      memoryRequest: 100Mi
      name: main
      manager:
        backupRetentionDays: 14
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: control-plane-ix1
          name: ix1
          volumeType: high-iops
        - instanceGroup: control-plane-ix2
          name: ix2
          volumeType: high-iops
        - instanceGroup: control-plane-es1
          name: es1
          volumeType: high-iops
      memoryRequest: 100Mi
      name: events
      manager:
        backupRetentionDays: 7
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: control-plane-ix1
          name: ix1
          volumeType: high-iops
        - instanceGroup: control-plane-ix2
          name: ix2
          volumeType: high-iops
        - instanceGroup: control-plane-es1
          name: es1
          volumeType: high-iops
      manager:
        backupRetentionDays: 7
        env:
          - name: ETCD_AUTO_COMPACTION_MODE
            value: revision
          - name: ETCD_AUTO_COMPACTION_RETENTION
            value: "2500"
      memoryRequest: 100Mi
      name: cilium
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
    - 0.0.0.0/0
    - ::/0
  kubernetesVersion: 1.27.3
  networkCIDR: 10.150.0.0/16
  nodePortAccess:
    - 10.150.0.0/16
  kubeProxy:
    enabled: false
  networking:
    cilium:
      clusterName: zcluster.k8s.local
      enablePrometheusMetrics: true
      etcdManaged: true
      enableBPFMasquerade: true
      hubble:
        enabled: true
      enableEncryption: true
      enableL7Proxy: true
      encryptionType: ipsec
      enableNodePort: false

  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
    - 0.0.0.0/0
    - ::/0
  subnets:
    - cidr: 10.150.32.0/19
      name: es1
      type: Private
      zone: es1
    - cidr: 10.150.64.0/19
      name: ix1
      type: Private
      zone: ix1
    - cidr: 10.150.96.0/19
      name: ix2
      type: Private
      zone: ix2
    - cidr: 10.150.0.0/22
      name: utility-es1
      type: Utility
      zone: es1
    - cidr: 10.150.4.0/22
      name: utility-ix1
      type: Utility
      zone: ix1
    - cidr: 10.150.8.0/22
      name: utility-ix2
      type: Utility
      zone: ix2
  topology:
    dns:
      type: Private
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: control-plane-ix1
spec:
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.small
  maxSize: 1
  minSize: 1
  role: Master
  # apply OS security upgrades, avoiding rebooting when possible
  updatePolicy: automatic
  # replace the instance every month
  maxInstanceLifetime: 730h
  additionalSecurityGroups:
    - cilium.etcd.masters.zcluster.k8s.local
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - ix1

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: control-plane-ix2
spec:
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.small
  maxSize: 1
  minSize: 1
  role: Master
  # apply OS security upgrades, avoiding rebooting when possible
  updatePolicy: automatic
  # replace the instance every month
  maxInstanceLifetime: 730h
  additionalSecurityGroups:
    - cilium.etcd.masters.zcluster.k8s.local
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - ix2

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: control-plane-es1
spec:
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.small
  maxSize: 1
  minSize: 1
  role: Master
  # apply OS security upgrades, avoiding rebooting when possible
  updatePolicy: automatic
  # replace the instance every month
  maxInstanceLifetime: 730h
  additionalSecurityGroups:
    - cilium.etcd.masters.zcluster.k8s.local
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - es1

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: nodes-ix1
spec:
  rollingUpdate:
    # openstack cloud provider does not support surging
    maxSurge: 1
    minAvailable: 1
  # apply OS security upgrades, avoiding rebooting when possible
  updatePolicy: automatic
  # replace the instance every month
  maxInstanceLifetime: 730h
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.micro
  autoscale: true
  maxSize: 1
  minSize: 1
  role: Node
  additionalSecurityGroups:
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - ix1

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: nodes-ix2
spec:
  rollingUpdate:
    # openstack cloud provider does not support surging yet
    maxSurge: 1
    minAvailable: 1
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.micro
  autoscale: true
  maxSize: 2
  minSize: 1
  role: Node
  additionalSecurityGroups:
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - ix2

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  annotations:
    openstack.kops.io/osVolumeBoot: "true"
    openstack.kops.io/osVolumeSize: "30"
  name: nodes-es1
spec:
  rollingUpdate:
    # openstack cloud provider does not support surging yet
    maxSurge: 1
    minAvailable: 1
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  machineType: s1.micro
  autoscale: true
  maxSize: 1
  minSize: 1
  role: Node
  additionalSecurityGroups:
    - cilium.hubble.zcluster.k8s.local
  subnets:
    - es1

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  name: bastions
spec:
  image: Ubuntu 22.04 Jammy Jellyfish - Latest
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: m1.micro
  maxSize: 1
  minSize: 1
  role: Bastion
  subnets:
    - es1
    - ix1
    - ix2

---
apiVersion: kops.k8s.io/v1alpha2
kind: SSHCredential
metadata:
  labels:
    kops.k8s.io/cluster: zcluster.k8s.local
  name: admin
spec:
  publicKey: ssh-ed25519 XXX

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I fixed this by recreating the Certificate, adding the new dns name. And by creating a new security group which allows ingress to TCP of 4244 for all nodes within the network cidr.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  labels:
    addon.kops.k8s.io/name: networking.cilium.io
    app.kubernetes.io/managed-by: kops
    k8s-app: cilium
    role.kubernetes.io/networking: "1"
  name: hubble-server-certs
  namespace: kube-system
spec:
  dnsNames:
    - "*.zcluster.k8s.local.hubble-grpc.cilium.io"
    - "*.zcluster-k8s-local.hubble-grpc.cilium.io" # new
  issuerRef:
    kind: Issuer
    name: networking.cilium.io
  secretName: hubble-server-certs
zetaab commented 1 year ago

UDP 4244 should not be needed https://docs.cilium.io/en/latest/operations/system_requirements/#firewall-requirements

following firewall changes at least needed:

zetaab commented 1 year ago

firewall changes https://github.com/kubernetes/kops/pull/15635

I cannot say anything to that certificate as I am not using cillium. ping @olemarkus ideas why it uses different format in that? I am still on vacation so I do not have access to openstack environment, need to wait until next month to solve that cert issue.

zadjadr commented 1 year ago

Thanks @zetaab