hashicorp / consul-helm

Helm chart to install Consul and other associated components.
Mozilla Public License 2.0
419 stars 386 forks source link

Consul Helm Sync Catalog Crashloopback because ReadinessProbe #869

Closed kholisrag closed 3 years ago

kholisrag commented 3 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

consul-sync-catalog crashloopback because kubernetes detect readiness probe have 500 response

Reproduction Steps

Steps to reproduce this issue, eg:

  1. When running helm install with the following values.yml:
    client:
    enabled: false
    global:
    acls:
    bootstrapToken:
      secretKey: null
      secretName: null
    createReplicationToken: false
    manageSystemACLs: false
    replicationToken:
      secretKey: null
      secretName: null
    datacenter: dc1
    domain: consul
    enableConsulNamespaces: false
    enablePodSecurityPolicies: false
    enabled: true
    federation:
    createFederationSecret: false
    enabled: false
    gossipEncryption:
    secretKey: key
    secretName: consul-gossip-encryption-key
    image: hashicorp/consul:1.9.3
    imageEnvoy: envoyproxy/envoy-alpine:v1.16.0
    imageK8S: hashicorp/consul-k8s:0.24.0
    imagePullSecrets: []
    name: consul
    openshift:
    enabled: false
    tls:
    caCert:
      secretKey: null
      secretName: null
    caKey:
      secretKey: null
      secretName: null
    enableAutoEncrypt: true
    enabled: true
    httpsOnly: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    server:
    affinity: |
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: {{ template "consul.name" . }}
              release: "{{ .Release.Name }}"
              component: server
          topologyKey: kubernetes.io/hostname
    annotations: null
    bootstrapExpect: null
    connect: true
    disruptionBudget:
    enabled: true
    maxUnavailable: null
    enabled: '-'
    exposeGossipAndRPCPorts: true
    extraConfig: |
    {}
    extraEnvironmentVars: {}
    extraLabels: null
    extraVolumes: []
    image: null
    nodeSelector: null
    ports:
    serflan:
      port: 8301
    priorityClassName: ""
    replicas: 3
    resources:
    limits:
      cpu: 500m
      memory: 500Mi
    requests:
      cpu: 100m
      memory: 100Mi
    securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsNonRoot: true
    runAsUser: 100
    service:
    annotations: null
    storage: 1Gi
    storageClass: ebs-sc-gp3
    tolerations: null
    updatePartition: 0
    syncCatalog:
    aclSyncToken:
    secretKey: null
    secretName: null
    addK8SNamespaceSuffix: true
    affinity: null
    consulNodeName: k8s-sync
    consulPrefix: null
    consulWriteInterval: null
    default: true
    enabled: true
    image: null
    k8sAllowNamespaces:
    - '*'
    k8sDenyNamespaces:
    - kube-system
    - kube-public
    k8sPrefix: null
    k8sSourceNamespace: null
    k8sTag: null
    logLevel: info
    nodePortSyncType: ExternalFirst
    nodeSelector: null
    priorityClassName: ""
    resources:
    limits:
      cpu: 50m
      memory: 50Mi
    requests:
      cpu: 50m
      memory: 50Mi
    syncClusterIPServices: false
    toConsul: true
    toK8S: true
    tolerations: null
  2. View error

Logs

Include any relevant logs.

Logs ``` [GET /health/ready] Error getting leader status: Get "https://consul-server:8501/v1/status/leader": x509: certificate signed by unknown authority [GET /health/ready] Error getting leader status: Get "https://consul-server:8501/v1/status/leader": x509: certificate signed by unknown authority [GET /health/ready] Error getting leader status: Get "https://consul-server:8501/v1/status/leader": x509: certificate signed by unknown authority [GET /health/ready] Error getting leader status: Get "https://consul-server:8501/v1/status/leader": x509: certificate signed by unknown authority [GET /health/ready] Error getting leader status: Get "https://consul-server:8501/v1/status/leader": x509: certificate signed by unknown authority 2021-03-18T12:50:41.724Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-default:15000] 30252 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" 2021-03-18T12:50:41.727Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-http:80] 32016 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" 2021-03-18T12:50:41.730Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-default:15000] 31230 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" 2021-03-18T12:50:42.026Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-udp:80] 32623 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" 2021-03-18T12:50:42.029Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-http:80] 30911 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" 2021-03-18T12:50:42.031Z [WARN] to-consul/sink: error registering service: node-name=k8s-sync service-name={{redacted}} service="&{ {{redacted}} [k8s] map[external-k8s-ns:dev external-source:kubernetes port-http:80] 31678 {{redacted}} map[] {0 0} false 0 0 }" err="Put "https://consul-server:8501/v1/catalog/register": x509: certificate signed by unknown authority" ```

the consul server working normally and the UI can be accessed,

NAME              READY   STATUS    RESTARTS   AGE
consul-server-0   1/1     Running   0          10h
consul-server-1   1/1     Running   0          10h
consul-server-2   1/1     Running   0          4d11h

we use AWS EKS 1.16

Expected behavior

What was the expected result?

Consul Service can be sync to kubernetes, without modifying coredns config

Environment details

If not already included, please provide the following:

kschoche commented 3 years ago

Hi @petrukngantuk thanks for filing this issue! It looks like the readiness probe is failing because catalog-sync is unable to reach a healthy server in the cluster where quorum has been established already. Could you also provide a bit more information about your cluster as well as the output of kubectl get pods showing that the other server nodes are online and healthy?

kholisrag commented 3 years ago

Hi @petrukngantuk thanks for filing this issue! It looks like the readiness probe is failing because catalog-sync is unable to reach a healthy server in the cluster where quorum has been established already. Could you also provide a bit more information about your cluster as well as the output of kubectl get pods showing that the other server nodes are online and healthy?

Added @kschoche

kschoche commented 3 years ago

Hi @petrukngantuk - thanks for updating the issue! The health endpoint for sync-catalog issues a consul client API call to check on the state of the consul cluster, and I noticed that you have clients disabled. When the sync catalog pod gets scheduled on a node which doesn't have a consul agent running it won't be able to complete the API call through the client and it will never become healthy. I've confirmed that I was able to reproduce the issue on my end with your yaml file and enabling clients should get you up and running! Please let me know if that helps out!

kholisrag commented 3 years ago

@kschoche I didn't want to enabled the client, btw 😁

kschoche commented 3 years ago

Hi @petrukngantuk I've created a fix which addresses this issue and should let you use sync+autoencrypt with clients disabled, it is in master now. For reference here is the PR that did the changes https://github.com/hashicorp/consul-helm/pull/891

I'll go ahead and close this one out as fixed, if you run into any problems feel free to let me know! cheers.