projectcontour / contour

Contour is a Kubernetes ingress controller using Envoy proxy.
https://projectcontour.io
Apache License 2.0
3.73k stars 679 forks source link

envoy container in CrashLoopBackOff : error initializing configuration #3264

Closed vinzo99 closed 3 years ago

vinzo99 commented 3 years ago

Hi

we are deploying Contour (v1.11.0), with Envoy (v1.16.2) as a DaemonSet, using the following yaml templates : https://github.com/projectcontour/contour/blob/release-1.11/examples/render/contour.yaml

We only applied minor changes to fit our configuration (such as pointing to our local images repository, adding privileges for RBAC etc).

When firing up the helm installation, the Envoy pod fails with CrashLoopBackOff, with the following error in the envoy container :

[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:305] initializing epoch 0 (base id=0, hot restart version=11.104)
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:307] statically linked extensions:
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.thrift_proxy.filters: envoy.filters.thrift.rate_limit, envoy.filters.thrift.router
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.tracers: envoy.dynamic.ot, envoy.lightstep, envoy.tracers.datadog, envoy.tracers.dynamic_ot, envoy.tracers.lightstep, envoy.tracers.opencensus, envoy.tracers.xray, envoy.tracers.zipkin, envoy.zipkin
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.retry_priorities: envoy.retry_priorities.previous_priorities
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.internal_redirect_predicates: envoy.internal_redirect_predicates.allow_listed_routes, envoy.internal_redirect_predicates.previous_routes, envoy.internal_redirect_predicates.safe_cross_scheme
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.thrift_proxy.protocols: auto, binary, binary/non-strict, compact, twitter
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.resource_monitors: envoy.resource_monitors.fixed_heap, envoy.resource_monitors.injected_resource
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.filters.network: envoy.client_ssl_auth, envoy.echo, envoy.ext_authz, envoy.filters.network.client_ssl_auth, envoy.filters.network.direct_response, envoy.filters.network.dubbo_proxy, envoy.filters.network.echo, envoy.filters.network.ext_authz, envoy.filters.network.http_connection_manager, envoy.filters.network.kafka_broker, envoy.filters.network.local_ratelimit, envoy.filters.network.mongo_proxy, envoy.filters.network.mysql_proxy, envoy.filters.network.postgres_proxy, envoy.filters.network.ratelimit, envoy.filters.network.rbac, envoy.filters.network.redis_proxy, envoy.filters.network.rocketmq_proxy, envoy.filters.network.sni_cluster, envoy.filters.network.sni_dynamic_forward_proxy, envoy.filters.network.tcp_proxy, envoy.filters.network.thrift_proxy, envoy.filters.network.zookeeper_proxy, envoy.http_connection_manager, envoy.mongo_proxy, envoy.ratelimit, envoy.redis_proxy, envoy.tcp_proxy
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.quic_client_codec: quiche
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.udp_listeners: quiche_quic_listener, raw_udp_listener
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.compression.decompressor: envoy.compression.gzip.decompressor
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.filters.listener: envoy.filters.listener.http_inspector, envoy.filters.listener.original_dst, envoy.filters.listener.original_src, envoy.filters.listener.proxy_protocol, envoy.filters.listener.tls_inspector, envoy.listener.http_inspector, envoy.listener.original_dst, envoy.listener.original_src, envoy.listener.proxy_protocol, envoy.listener.tls_inspector
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.quic_server_codec: quiche
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.filters.udp_listener: envoy.filters.udp.dns_filter, envoy.filters.udp_listener.udp_proxy
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.retry_host_predicates: envoy.retry_host_predicates.omit_canary_hosts, envoy.retry_host_predicates.omit_host_metadata, envoy.retry_host_predicates.previous_hosts
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.compression.compressor: envoy.compression.gzip.compressor
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.dubbo_proxy.filters: envoy.filters.dubbo.router
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.thrift_proxy.transports: auto, framed, header, unframed
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.dubbo_proxy.protocols: dubbo
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.health_checkers: envoy.health_checkers.redis
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.bootstrap: envoy.extensions.network.socket_interface.default_socket_interface
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.dubbo_proxy.route_matchers: default
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.transport_sockets.downstream: envoy.transport_sockets.alts, envoy.transport_sockets.quic, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, tls
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.dubbo_proxy.serializers: dubbo.hessian2
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.upstreams: envoy.filters.connection_pools.http.generic, envoy.filters.connection_pools.http.http, envoy.filters.connection_pools.http.tcp
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.resolvers: envoy.ip
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.clusters: envoy.cluster.eds, envoy.cluster.logical_dns, envoy.cluster.original_dst, envoy.cluster.static, envoy.cluster.strict_dns, envoy.clusters.aggregate, envoy.clusters.dynamic_forward_proxy, envoy.clusters.redis
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.transport_sockets.upstream: envoy.transport_sockets.alts, envoy.transport_sockets.quic, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.tap, envoy.transport_sockets.tls, envoy.transport_sockets.upstream_proxy_protocol, raw_buffer, tls
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.guarddog_actions: envoy.watchdog.abort_action, envoy.watchdog.profile_action
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.http.cache: envoy.extensions.http.cache.simple
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.filters.http: envoy.buffer, envoy.cors, envoy.csrf, envoy.ext_authz, envoy.fault, envoy.filters.http.adaptive_concurrency, envoy.filters.http.admission_control, envoy.filters.http.aws_lambda, envoy.filters.http.aws_request_signing, envoy.filters.http.buffer, envoy.filters.http.cache, envoy.filters.http.cdn_loop, envoy.filters.http.compressor, envoy.filters.http.cors, envoy.filters.http.csrf, envoy.filters.http.decompressor, envoy.filters.http.dynamic_forward_proxy, envoy.filters.http.dynamo, envoy.filters.http.ext_authz, envoy.filters.http.fault, envoy.filters.http.grpc_http1_bridge, envoy.filters.http.grpc_http1_reverse_bridge, envoy.filters.http.grpc_json_transcoder, envoy.filters.http.grpc_stats, envoy.filters.http.grpc_web, envoy.filters.http.gzip, envoy.filters.http.header_to_metadata, envoy.filters.http.health_check, envoy.filters.http.ip_tagging, envoy.filters.http.jwt_authn, envoy.filters.http.local_ratelimit, envoy.filters.http.lua, envoy.filters.http.oauth, envoy.filters.http.on_demand, envoy.filters.http.original_src, envoy.filters.http.ratelimit, envoy.filters.http.rbac, envoy.filters.http.router, envoy.filters.http.squash, envoy.filters.http.tap, envoy.grpc_http1_bridge, envoy.grpc_json_transcoder, envoy.grpc_web, envoy.gzip, envoy.health_check, envoy.http_dynamo_filter, envoy.ip_tagging, envoy.local_rate_limit, envoy.lua, envoy.rate_limit, envoy.router, envoy.squash
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.access_loggers: envoy.access_loggers.file, envoy.access_loggers.http_grpc, envoy.access_loggers.tcp_grpc, envoy.file_access_log, envoy.http_grpc_access_log, envoy.tcp_grpc_access_log
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.grpc_credentials: envoy.grpc_credentials.aws_iam, envoy.grpc_credentials.default, envoy.grpc_credentials.file_based_metadata
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.stats_sinks: envoy.dog_statsd, envoy.metrics_service, envoy.stat_sinks.dog_statsd, envoy.stat_sinks.hystrix, envoy.stat_sinks.metrics_service, envoy.stat_sinks.statsd, envoy.statsd
[2021-01-19 14:41:26.115][1][info][main] [source/server/server.cc:309]   envoy.udp_packet_writers: udp_default_writer, udp_gso_batch_writer
[2021-01-19 14:41:26.123][1][info][main] [source/server/server.cc:325] HTTP header map info:
[2021-01-19 14:41:26.124][1][warning][runtime] [source/common/runtime/runtime_features.cc:31] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[2021-01-19 14:41:26.124][1][warning][runtime] [source/common/runtime/runtime_features.cc:31] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[2021-01-19 14:41:26.125][1][warning][runtime] [source/common/runtime/runtime_features.cc:31] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[2021-01-19 14:41:26.125][1][warning][runtime] [source/common/runtime/runtime_features.cc:31] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
[2021-01-19 14:41:26.125][1][info][main] [source/server/server.cc:328]   request header map: 608 bytes: :authority,:method,:path,:protocol,:scheme,accept,accept-encoding,access-control-request-method,authorization,cache-control,cdn-loop,connection,content-encoding,content-length,content-type,expect,grpc-accept-encoding,grpc-timeout,if-match,if-modified-since,if-none-match,if-range,if-unmodified-since,keep-alive,origin,pragma,proxy-connection,referer,te,transfer-encoding,upgrade,user-agent,via,x-client-trace-id,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-downstream-service-cluster,x-envoy-downstream-service-node,x-envoy-expected-rq-timeout-ms,x-envoy-external-address,x-envoy-force-trace,x-envoy-hedge-on-per-try-timeout,x-envoy-internal,x-envoy-ip-tags,x-envoy-max-retries,x-envoy-original-path,x-envoy-original-url,x-envoy-retriable-header-names,x-envoy-retriable-status-codes,x-envoy-retry-grpc-on,x-envoy-retry-on,x-envoy-upstream-alt-stat-name,x-envoy-upstream-rq-per-try-timeout-ms,x-envoy-upstream-rq-timeout-alt-response,x-envoy-upstream-rq-timeout-ms,x-forwarded-client-cert,x-forwarded-for,x-forwarded-proto,x-ot-span-context,x-request-id
[2021-01-19 14:41:26.125][1][info][main] [source/server/server.cc:328]   request trailer map: 128 bytes: 
[2021-01-19 14:41:26.125][1][info][main] [source/server/server.cc:328]   response header map: 424 bytes: :status,access-control-allow-credentials,access-control-allow-headers,access-control-allow-methods,access-control-allow-origin,access-control-expose-headers,access-control-max-age,age,cache-control,connection,content-encoding,content-length,content-type,date,etag,expires,grpc-message,grpc-status,keep-alive,last-modified,location,proxy-connection,server,transfer-encoding,upgrade,vary,via,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-degraded,x-envoy-immediate-health-check-fail,x-envoy-ratelimited,x-envoy-upstream-canary,x-envoy-upstream-healthchecked-cluster,x-envoy-upstream-service-time,x-request-id
[2021-01-19 14:41:26.125][1][info][main] [source/server/server.cc:328]   response trailer map: 152 bytes: grpc-message,grpc-status
[2021-01-19 14:41:26.126][1][info][main] [source/server/server.cc:448] admin address: 127.0.0.1:9001
[2021-01-19 14:41:26.128][1][info][main] [source/server/server.cc:583] runtime: layers:
  - name: base
    static_layer:
      {}
  - name: admin
    admin_layer:
      {}
[2021-01-19 14:41:26.128][1][info][config] [source/server/configuration_impl.cc:95] loading tracing configuration
[2021-01-19 14:41:26.128][1][info][config] [source/server/configuration_impl.cc:70] loading 0 static secret(s)
[2021-01-19 14:41:26.128][1][info][config] [source/server/configuration_impl.cc:76] loading 2 cluster(s)
[2021-01-19 14:41:26.129][1][critical][main] [source/server/server.cc:102] error initializing configuration '/config/envoy.json': envoy::api::v2::Path must refer to an existing path in the system: '/config/resources/sds/xds-tls-certificate.json' does not exist
[2021-01-19 14:41:26.129][1][info][main] [source/server/server.cc:731] exiting
envoy::api::v2::Path must refer to an existing path in the system: '/config/resources/sds/xds-tls-certificate.json' does not exist

this error wasn't occurring with older versions.

Just for information, the contour-certgen job has been successfully run, and the Contour pods are up&running.

Can you please advise ?

Thanks

sunjayBhatia commented 3 years ago

The contour bootstrap init container that generates /config/resources/sds/xds-tls-certificate.json may have failed or not run

Do you have any error/status information for that and does your config match https://github.com/projectcontour/contour/blob/78c434fcd9aa8e08f12ee5def7c0e215c0c805c5/examples/contour/03-envoy.yaml#L98 ?

vinzo99 commented 3 years ago

Here is the full describe of the pod :

Name:         envoy-jq658
Namespace:    vbertell
Priority:     0
Node:         douzeasrclsuster-edge-02/172.16.1.7
Start Time:   Tue, 19 Jan 2021 13:59:30 +0000
Labels:       app=envoy
              controller-revision-hash=b7746ff5b
              pod-template-generation=1
Annotations:  kubernetes.io/psp: privileged
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 8002
              prometheus.io/scrape: true
              seccomp.security.alpha.kubernetes.io/pod: docker/default
Status:       Running
IP:           172.16.1.7
IPs:
  IP:           172.16.1.7
Controlled By:  DaemonSet/envoy
Init Containers:
  envoy-initconfig:
    Container ID:  docker://46ba7dc39f4f8243107706df8bccc559db9dd900de09b33230d71fcff6194a31
    Image:         xxxxx/projectcontour/contour:v1.11.0
    Image ID:      docker-pullable://rxxxxx/projectcontour/contour@sha256:a0f9675ae2f1d8204e036ae2a73e4b1c79be19f1b02bb7478bd77b17251179b0
    Port:          <none>
    Host Port:     <none>
    Command:
      contour
    Args:
      bootstrap
      /config/envoy.json
      --xds-address=contour
      --xds-port=8001
      --xds-resource-version=v3
      --resources-dir=/config/resources
      --envoy-cafile=/certs/ca.crt
      --envoy-cert-file=/certs/tls.crt
      --envoy-key-file=/certs/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 19 Jan 2021 13:59:35 +0000
      Finished:     Tue, 19 Jan 2021 13:59:35 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      CONTOUR_NAMESPACE:  vbertell (v1:metadata.namespace)
    Mounts:
      /certs from envoycert (ro)
      /config from envoy-config (rw)
Containers:
  shutdown-manager:
    Container ID:  docker://df9c6a5d5d43c4d86d30cf1a29f4bf0c8e721b2a0e2dd026daa9808e70c3dad3
    Image:         xxxxx/projectcontour/contour:v1.11.0
    Image ID:      docker-pullable://xxxxx/projectcontour/contour@sha256:a0f9675ae2f1d8204e036ae2a73e4b1c79be19f1b02bb7478bd77b17251179b0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/contour
    Args:
      envoy
      shutdown-manager
    State:          Running
      Started:      Tue, 19 Jan 2021 13:59:36 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8090/healthz delay=3s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:         <none>
  envoy:
    Container ID:  docker://0d5497db623463b9920c1cd55cb486810964b59afe69184d280248f5e9dba8a8
    Image:         xxxxx/cesp-envoy:1.16.2-1-2-ic
    Image ID:      docker-pullable://xxxxx/cesp-envoy@sha256:0486f32009dac92457ee88b005c6c57574d66f513046af43f895cdc5e6d18eb5
    Ports:         80/TCP, 443/TCP
    Host Ports:    80/TCP, 443/TCP
    Command:
      envoy
    Args:
      -c
      /config/envoy.json
      --service-cluster $(CONTOUR_NAMESPACE)
      --service-node $(ENVOY_POD_NAME)
      --log-level info
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 20 Jan 2021 08:30:01 +0000
      Finished:     Wed, 20 Jan 2021 08:30:01 +0000
    Ready:          False
    Restart Count:  222
    Readiness:      http-get http://:8002/ready delay=3s timeout=1s period=4s #success=1 #failure=3
    Environment:
      CONTOUR_NAMESPACE:  vbertell (v1:metadata.namespace)
      ENVOY_POD_NAME:     envoy-jq658 (v1:metadata.name)
    Mounts:
      /certs from envoycert (rw)
      /config from envoy-config (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  envoy-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  envoycert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  envoycert
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  is_edge=true
Tolerations:     is_edge=true:NoExecute
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason   Age                     From                               Message
  ----     ------   ----                    ----                               -------
  Normal   Pulled   12m (x220 over 18h)     kubelet, douzeasrclsuster-edge-02  Container image "xxxxx/cesp-envoy:1.16.2-1-2-ic" already present on machine
  Warning  BackOff  2m50s (x5141 over 18h)  kubelet, douzeasrclsuster-edge-02  Back-off restarting failed container

The init container matches the suggested yaml configuration as you can see above, unless you spot any error in the configuration ?

The kubectl logs -c envoy-initconfig for the pod are empty, maybe there is a way to access them in a different way, or increase debug level ?

Thanks

youngnick commented 3 years ago

Hi @vinzo99, sorry you have this problem, it's definitely not good!

The key error here appears to be '/config/resources/sds/xds-tls-certificate.json' not existing. That file is part of the system we use to secure the communication between Contour and Envoy.

That system requires the Contour namespace (ie projectcontour) to have the secrets contourcert, envoycert, and cacert. These secrets are created by the contour-certgen Job, which runs a container which creates them.

I'd start by checking if the secrets are present, and if the contour-certgen Job ran. (You can use kubectl get job -n projectcontour for this).

vinzo99 commented 3 years ago

Hi @youngnick, I just checked what you suggested :

1°) the job has successfully run :

NAME                      COMPLETIONS   DURATION   AGE
contour-certgen-v1.11.0   1/1           5s         62s

2°) here are the job details :

Name:           contour-certgen-v1.11.0
Namespace:      vbertell
Selector:       controller-uid=5febfb2c-0f0a-4851-9978-17baa4501312
Labels:         app=contour-certgen
                controller-uid=5febfb2c-0f0a-4851-9978-17baa4501312
                job-name=contour-certgen-v1.11.0
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Mon, 25 Jan 2021 07:57:16 +0000
Completed At:   Mon, 25 Jan 2021 07:57:21 +0000
Duration:       5s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:           app=contour-certgen
                    controller-uid=5febfb2c-0f0a-4851-9978-17baa4501312
                    job-name=contour-certgen-v1.11.0
  Service Account:  contour-certgen
  Containers:
   contour:
    Image:      registry1-docker-io.repo.lab.pl.alcatel-lucent.com/projectcontour/contour:v1.11.0
    Port:       <none>
    Host Port:  <none>
    Command:
      contour
      certgen
      --kube
      --incluster
      --overwrite
      --secrets-format=compact
      --namespace=$(CONTOUR_NAMESPACE)
    Environment:
      CONTOUR_NAMESPACE:   (v1:metadata.namespace)
    Mounts:               <none>
  Volumes:                <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  73s   job-controller  Created pod: contour-certgen-v1.11.0-6gtd6

3°) contourcert and envoycert secrets are there :

NAME                          TYPE                                  DATA   AGE
contourcert                   kubernetes.io/tls                     3      18s
envoycert                     kubernetes.io/tls                     3      18s

but cacert secret does NOT exist, this is most probably related ?

Thanks !

vinzo99 commented 3 years ago

@youngnick meanwhile I also tried and manually perform what contour-certgen job does, following those directions : https://projectcontour.io/docs/main/grpc-tls-howto/ (btw, I also had to manually create certs/ and _integration/ directories, and manually touch _integration/cert-contour.ext and _integration/cert-envoy.extfiles in order to successfully run openssl commands)

Still no cacert secret, I finally managed to manually create it using old directions (that may be deprecated ...) : kubectl create secret -n vbertell generic cacert --from-file=./certs/cacert.pem

All secrets are now here :

NAME                          TYPE                                  DATA   AGE
cacert                        Opaque                                1      7m18s
contourcert                   Opaque                                3      17m
envoycert                     Opaque                                3      17m

Envoy pod status is still CrashLoopBackOff with same errors.

stevesloka commented 3 years ago

The ca.crt should be embedded in the envoycert secret and the contourcert secret. Can you confirm you have these?

$ kubectl describe secret envoycert -n projectcontour                                                                                                                                                                                                                                                           
Name:         envoycert
Namespace:    projectcontour
Labels:       app=contour
Annotations:  <none>

Type:  kubernetes.io/tls

Data
====
tls.key:  1675 bytes
ca.crt:   1139 bytes
tls.crt:  1265 bytes
 $ kubectl exec -it -n projectcontour envoy-78hrf -c envoy cat /certs/ca.crt                                                                                                                                                                                                                                     
-----BEGIN CERTIFICATE-----
<certData>
-----END CERTIFICATE-----
vinzo99 commented 3 years ago

@stevesloka sure :

# kubectl -n vbertell describe secret envoycert
Name:         envoycert
Namespace:    vbertell
Labels:       <none>
Annotations:  
Type:         Opaque

Data
====
ca.crt:   1188 bytes
tls.crt:  1066 bytes
tls.key:  1675 bytes

I can't exec cat /certs/ca.crt since envoy container has crashed in envoy-mlhl9 pod, obviously :

# kubectl -n vbertell exec -it envoy-mlhl9 -c envoy cat /certs/ca.crt
error: unable to upgrade connection: container not found ("envoy")
stevesloka commented 3 years ago

Could you try killing the Envoy pod and letting it restart? At one point, some folks did see an issue where the Envoy pod would try to start before the secrets were ready in the shared secret (but shouldn't happen because it's done in an initContainer).

vinzo99 commented 3 years ago

I already tried that, same result. And you're right, since the job is performed in the initContainer, the envoy container should end up getting started, which it does not, it keeps trying to restart indefinitely. See here, after 3+ days :

Events:
  Type     Reason   Age                     From                               Message
  ----     ------   ----                    ----                               -------
  Warning  BackOff  89s (x21618 over 3d6h)  kubelet, douzeasrclsuster-edge-02  Back-off restarting failed container
vinzo99 commented 3 years ago

Hi, any hints on this issue ? Thanks

youngnick commented 3 years ago

Hi @vinzo99, you can see that @skriss has put this one in "Needs Investigation" in our project board. That means that one of us will need to try and reproduce the issue to see if we can figure out what's causing it.

The way this whole setup works is:

So, I have a couple of questions for you:

vinzo99 commented 3 years ago

Hi @youngnick, regarding your 2 questions :

1°) the pods can indeed create emptyDir volumes. I just made sure of that by completing this short example : https://kubernetes.io/docs/tasks/configure-pod-container/configure-volume-storage/

2°) I killed the Envoy pod using this command :

# kubectl -n vbertell get pod
NAME                       READY   STATUS             RESTARTS   AGE
contour-6b85456c49-4xm4r   1/1     Running            0          15d
contour-6b85456c49-mxbrp   1/1     Running            1          15d
envoy-nj78p                1/2     CrashLoopBackOff   3290       11d

# kubectl -n vbertell delete pod envoy-nj78p
pod "envoy-nj78p" deleted

# kubectl -n vbertell get pod
NAME                       READY   STATUS             RESTARTS   AGE
contour-6b85456c49-4xm4r   1/1     Running            0          15d
contour-6b85456c49-mxbrp   1/1     Running            1          15d
envoy-2s28h                1/2     CrashLoopBackOff   2          32s

I'm no sure what we can do to monitor the bootstrap process, apart from creating a dummy pod that recreates all actions performed by envoy-initconfig, or maybe start envoy-initconfig using a custom Contour image with a shell for debug purposes ...

Thanks !

stevesloka commented 3 years ago

@vinzo99 I wonder if we're chasing the wrong thing, one item that might be an issue in your environment is the default hostPorts in the Envoy daemonset. Does your environment allow that? Maybe remove those references and see if the pod starts.

I would expect some sort of error relating but just trying to think of what else it might be.

If not the other path we could try is removing the initContainer & certGen and pushing the bits manually.

vinzo99 commented 3 years ago

@stevesloka

FYI : when I first tried to deploy Contour with the default yaml files I faced the following issue when launching envoy DaemonSet :

Events:
  Type     Reason        Age                From                  Message
  ----     ------        ----               ----                  -------
  Warning  FailedCreate  3s (x13 over 24s)  daemonset-controller  Error creating: pods "envoy-" is forbidden: unable to validate against any pod security policy: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[1].hostPort: Invalid value: 80: Host port 80 is not allowed to be used. Allowed ports: [] spec.containers[1].hostPort: Invalid value: 443: Host port 443 is not allowed to be used. Allowed ports: []]

envoy DaemonSet was not even starting at this point, due to my cluster RBAC environment.

I quickly solved this issue, by adding the following rules in contour ClusterRole :

- apiGroups:
  - extensions
  resourceNames:
  - privileged
  resources:
  - podsecuritypolicies
  verbs:
  - use

which leads us to the current state with envoy DaemonSet trying to start, then Envoy container crashing with CrashLoopBackOff.

Not sure this is what you meant though.

Thanks !

stevesloka commented 3 years ago

@vinzo99 yup this might be it, but let's confirm. =)

In the examples, the Envoy daemonset which deploys the Envoy pods has two hostPort entries (https://github.com/projectcontour/contour/blob/main/examples/contour/03-envoy.yaml#L72 & https://github.com/projectcontour/contour/blob/main/examples/contour/03-envoy.yaml#L76).

Can you remove those and see if your pod spins up properly? I may not work because we need to swap the service values around, but that will tell us what the problem is and then where to go.

Thanks!

vinzo99 commented 3 years ago

@stevesloka I removed both hostPort lines and get the exact same result :

Thanks !

stevesloka commented 3 years ago

Are you using Pod Security Policies? Can you share any information about your cluster? Is seems like @youngnick suggested something with the initContainer isn't working to create this default config. Let me see if I can pick out the bits into a configmap, have you apply that and see if you can get it working.

vinzo99 commented 3 years ago

@stevesloka the cluster has 2 main Pod Security Policies, restricted and privileged :

# kubectl get podsecuritypolicy
NAME              PRIV    CAPS        SELINUX    RUNASUSER          FSGROUP     SUPGROUP    READONLYROOTFS   VOLUMES
privileged        true    *           RunAsAny   RunAsAny           RunAsAny    RunAsAny    false            *
restricted        false               RunAsAny   MustRunAsNonRoot   MustRunAs   MustRunAs   false            configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim,hostPath

# kubectl describe podsecuritypolicy restricted
Name:  restricted

Settings:
  Allow Privileged:                        false
  Allow Privilege Escalation:              false
  Default Add Capabilities:                <none>
  Required Drop Capabilities:              ALL
  Allowed Capabilities:                    <none>
  Allowed Volume Types:                    configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim,hostPath
  Allow Host Network:                      false
  Allow Host Ports:                        <none>
  Allow Host PID:                          false
  Allow Host IPC:                          false
  Read Only Root Filesystem:               false
  SELinux Context Strategy: RunAsAny       
    User:                                  <none>
    Role:                                  <none>
    Type:                                  <none>
    Level:                                 <none>
  Run As User Strategy: MustRunAsNonRoot   
    Ranges:                                <none>
  FSGroup Strategy: MustRunAs              
    Ranges:                                1-65535
  Supplemental Groups Strategy: MustRunAs  
    Ranges:                                1-65535

# kubectl describe podsecuritypolicy privileged
Name:  privileged

Settings:
  Allow Privileged:                       true
  Allow Privilege Escalation:             true
  Default Add Capabilities:               <none>
  Required Drop Capabilities:             <none>
  Allowed Capabilities:                   *
  Allowed Volume Types:                   *
  Allow Host Network:                     true
  Allow Host Ports:                       0-65535
  Allow Host PID:                         true
  Allow Host IPC:                         true
  Read Only Root Filesystem:              false
  SELinux Context Strategy: RunAsAny      
    User:                                 <none>
    Role:                                 <none>
    Type:                                 <none>
    Level:                                <none>
  Run As User Strategy: RunAsAny          
    Ranges:                               <none>
  FSGroup Strategy: RunAsAny              
    Ranges:                               <none>
  Supplemental Groups Strategy: RunAsAny  
    Ranges:                               <none>

the following rules have been added in contour ClusterRole in order to grant access to ports etc :

- apiGroups:
  - extensions
  resourceNames:
  - privileged
  resources:
  - podsecuritypolicies
  verbs:
  - use

In any case, emptyDir Volume Type is allowed even without privileged rules.

Hope this helps ...

Thanks !

stevesloka commented 3 years ago

I just spun up a minikube cluster with PSP enabled and I had to change a few things to get this to work:

I didn't need to modify the ClusterRole as you did, it just worked without.

Which helm chart did you use? I can try and recreate that, I never use the helm chart, just use the examples, but want to double check that setup (maybe it's different than the contour repo).

I can put together the files as well to avoid the initContainer, but wanted to double-check the helm chart bits.

vinzo99 commented 3 years ago

@stevesloka a few inputs :

I tried and add a securitycontext block in the Envoy DaemonSet (btw I used runAsUser: 65534 instead of 1000, I believe 65534 is the right ID to choose if you want to preserve the same ID in the whole environment). No improvement.

Regarding the other suggestions : unfortunately I need the Envoy pods to act as listeners for the ingress controller, and therefore listen on 80 / 443 on edge nodes, which are now the defaults in the Contour charts. This is the very reason why we need Contour btw. I believe what works for you in a single node minikube system with non-root ports, will not suit our configuration (switching to 8080 / 8443, removing hostPort), unless I'm missing a point.

In order to achieve a configuration without a K8S LoadBalancer, with Host Networking (as explained here : https://projectcontour.io/docs/v1.11.0/deploy-options/#host-networking) I used the example charts provided here : https://github.com/projectcontour/contour/blob/release-1.11/examples/render/contour.yaml

I had to make slight modifications to the charts, such as :

Those modifications have been successfully tested on the same cluster, on Envoy/Contour versions released before the introduction of the new certgen process a few months back.

Thanks !

youngnick commented 3 years ago

Thanks @vinzo99. I still think it's likely that the files are not getting created properly by the bootstrap.

I thought of a way to check this, which is not ideal, but should work to let you check what the Envoy container is seeing.

In the Envoy daemonset, replace the command and args sections with this:

        args:
        - -c
        - "sleep 86400"
        command:
        - /bin/bash

This will just run a sleeping bash job instead of trying to run Envoy. Then you should be able to kubectl exec in and have a look around. (kubectl exec -t <envoy pod> -c envoy -- /bin/bash will get you an interactive shell.)

The things we need to know to find more about this are:

If the /config/resources directory is empty, then something is preventing the contour bootstrap command from outputting its files correctly. Can you check that the bootstrap container command and args looks like this?

        args:
        - bootstrap
        - /config/envoy.json
        - --xds-address=contour
        - --xds-port=8001
        - --xds-resource-version=v3
        - --resources-dir=/config/resources
        - --envoy-cafile=/certs/ca.crt
        - --envoy-cert-file=/certs/tls.crt
        - --envoy-key-file=/certs/tls.key
        command:
        - contour

The key one is the --resources-dir arg, without it the bootstrap won't attempt to create those files.

vinzo99 commented 3 years ago

Hi @youngnick thanks for your suggestions !

Hi just replaced this part by the suggested one in the Envoy DaemonSet, in order to start a standard shell with 24hrs sleep instead of the envoy command, and get access to the container :

#      - args:
#        - -c
#        - /config/envoy.json
#        - --service-cluster $(CONTOUR_NAMESPACE)
#        - --service-node $(ENVOY_POD_NAME)
#        - --log-level info
#        command:
#        - envoy
      - args:
        - -c
        - "sleep 86400"
        command:
        - /bin/bash

The envoy pod starts. For some reason I am not able to log into the container, the kubectl command returns right away, but I still can run single commands, that basically show that resources/ has permissions issues :

# kubectl -n vbertell exec -t envoy-8d86c -c envoy -- ls /config/resources
ls: cannot open directory '/config/resources': Permission denied
command terminated with exit code 2

# kubectl -n vbertell exec -t envoy-8d86c -c envoy -- ls -l /config
total 8
-rw-r--r--. 1 root root 1873 Feb 15 07:32 envoy.json
drwxr-x---. 3 root root 4096 Feb 15 07:32 resources

I am able to touch a file in config/, which goes to show that emptyDir volume behaves as expected :

# kubectl -n vbertell exec -t envoy-8d86c -c envoy -- touch /config/toto

# kubectl -n vbertell exec -t envoy-8d86c -c envoy -- ls -l /config
total 8
-rw-r--r--. 1 root  root  1873 Feb 15 07:32 envoy.json
drwxr-x---. 3 root  root  4096 Feb 15 07:32 resources
-rw-r--r--. 1 envoy envoy    0 Feb 15 07:53 toto

but not in /config/resources/ :

# kubectl -n vbertell exec -t envoy-8d86c -c envoy -- touch /config/resources/toto
touch: cannot touch '/config/resources/toto': Permission denied
command terminated with exit code 1

I believe this directory is created by contour-certgen, right ?

I also checked the bootstrap part, which seems correct :

      initContainers:
      - args:
        - bootstrap
        - /config/envoy.json
        - --xds-address=contour
        - --xds-port=8001
        - --xds-resource-version=v3
        - --resources-dir=/config/resources
        - --envoy-cafile=/certs/ca.crt
        - --envoy-cert-file=/certs/tls.crt
        - --envoy-key-file=/certs/tls.key
        command:
        - contour

Thanks !

youngnick commented 3 years ago

Thanks for that @vinzo99, I think you may have missed the -i on the kubectl exec command, that would give you an interactive shell (as opposed to just a terminal with -t). So it should be kubectl exec -it.

I'll check the permissions for the created directory, this sounds promising, that it's something about the directory creation that's the problem.

Edit: Yes, I can see that this is a "envoy is not running as the root user" problem. The initContainer runs as root, but Envoy is running as the user envoy, which doesn't have access to the /config/resources directory. You can see this from the listing where you touch the toto file.

I'd rather not make the /config/resources directory world-readable, but is there any way you could make sure that the initContainer runs as the same user as the envoy container? I think that will get you working for now, while I look into the permissions.

vinzo99 commented 3 years ago

Hi @youngnick !

I followed your suggestion and added a security context in the initContainer in order to launch it with the same user:group envoy:envoy 4444:4444 as the Envoy container itself. That did the trick for the permissions, /config/resources is now created with all expected files in it :

# kubectl -n vbertell exec -it envoy-l8g7b -c envoy /bin/bash
[envoy@douzeasrclsuster-edge-02 /]$ cd /config/
[envoy@douzeasrclsuster-edge-02 config]$ ls -l
total 8
-rw-r--r--. 1 envoy envoy 1873 Feb 16 13:54 envoy.json
drwxr-x---. 3 envoy envoy 4096 Feb 16 13:54 resources
[envoy@douzeasrclsuster-edge-02 config]$ cd resources/
[envoy@douzeasrclsuster-edge-02 resources]$ ll
total 4
drwxr-x---. 2 envoy envoy 4096 Feb 16 13:54 sds
[envoy@douzeasrclsuster-edge-02 resources]$ cd sds/
[envoy@douzeasrclsuster-edge-02 sds]$ ll
total 8
-rw-r--r--. 1 envoy envoy 210 Feb 16 13:54 xds-tls-certificate.json
-rw-r--r--. 1 envoy envoy 209 Feb 16 13:54 xds-validation-context.json

Now the Envoy pod starts, but the containers on the edge node fail to bind on 80 and 443 ports :

[2021-02-16 14:09:30.960][1][warning][config] [source/common/config/grpc_subscription_impl.cc:107] gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) ingress_http: cannot bind '0.0.0.0:80': Permission denied
ingress_https: cannot bind '0.0.0.0:443': Permission denied

I remember @stevesloka suggested to remove hostPorts in the DaemonSet, which I tried, still get the error.

Here is the netstat command for envoy on edge node :

# netstat -anp|grep envoy
tcp        0      0 0.0.0.0:8002            0.0.0.0:*               LISTEN      31621/envoy         
tcp        0      0 127.0.0.1:9001          0.0.0.0:*               LISTEN      31621/envoy         
tcp        0      0 127.0.0.1:43866         127.0.0.1:9001          ESTABLISHED 31621/envoy         
tcp        0      0 139.54.131.84:46454     10.254.91.148:8001      ESTABLISHED 31621/envoy         
tcp        0      0 127.0.0.1:9001          127.0.0.1:43866         ESTABLISHED 31621/envoy         
tcp        0      0 127.0.0.1:9001          127.0.0.1:43896         ESTABLISHED 31621/envoy         
tcp        0      0 127.0.0.1:43896         127.0.0.1:9001          ESTABLISHED 31621/envoy         
unix  2      [ ]         DGRAM                    507776613 31621/envoy          @envoy_domain_socket_parent_0@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
unix  2      [ ]         DGRAM                    507776612 31621/envoy          @envoy_domain_socket_child_0@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Any hints on this ? FYI since the Envoy pod is now started I am able to log into the Envoy container, which should make troubleshooting a lot easier.

Thanks !

stevesloka commented 3 years ago

Hey @vinzo99, after taking out the hostPort references, you'll need to also edit the Contour deployment to remove the two args which tell Envoy to bind to ports 80/443, otherwise it will still try.

So your steps are:

  1. Remove host ports
  2. Edit contour deployment to remove 2 args
  3. Change service to add targetPorts of 8080, & 8443 (defaults)
vinzo99 commented 3 years ago

Hi @stevesloka !

Like I said, I have no other choice but to use port 80 and 443, using Contour+Envoy as ingress controller without a K8S LoadBalancer.

I still get Permission denied to bind 80 and 443 on edge node :

[2021-02-17 11:22:09.582][1][warning][config] [source/common/config/grpc_subscription_impl.cc:107] gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) ingress_http: cannot bind '0.0.0.0:80': Permission denied
ingress_https: cannot bind '0.0.0.0:443': Permission denied

Just to clear any doubts, I installed an older working version (Envoy 1.14.1 + Contour 1.4.0) on the same cluster, there is no binding issue as you can see here :

# netstat -anp|grep envoy
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      14612/envoy         
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      14612/envoy         
tcp        0      0 0.0.0.0:8002            0.0.0.0:*               LISTEN      14612/envoy         
tcp        0      0 127.0.0.1:9001          0.0.0.0:*               LISTEN      14612/envoy         
tcp        0      0 139.54.131.84:45470     10.254.223.244:8001     ESTABLISHED 14612/envoy         
tcp        0      0 127.0.0.1:42434         127.0.0.1:9001          ESTABLISHED 14612/envoy         
tcp        0      0 127.0.0.1:9001          127.0.0.1:42434         ESTABLISHED 14612/envoy         
unix  2      [ ]         DGRAM                    509537593 14612/envoy          @envoy_domain_socket_parent_0@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
unix  2      [ ]         DGRAM                    509537592 14612/envoy          @envoy_domain_socket_child_0@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

I compared configurations between old and current charts, apart from a few unrelated changes (cacert, adding metrics etc) there are all the same regarding ports.

I believe something else is preventing Envoy to bind on 80 and 443 in this specific configuration. Not sure if this is related to the security context added as a workaround, as suggested by @youngnick to execute initContainer as envoy user.

Thanks !

youngnick commented 3 years ago

Not being able to bind the 80 and 443 ports is either going to be related to the security context, or something weird going on with the hostPort thing. If you can post your Envoy Daemonset YAML, we can take a look, but without that, I'm not sure how much more we will be able to help.

vinzo99 commented 3 years ago

@youngnick sure !

The configuration is based on the template https://github.com/projectcontour/contour/blob/release-1.11/examples/render/contour.yaml, + the following :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: envoy
  name: envoy
#  namespace: projectcontour
  namespace: {{ .Release.Namespace }}
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  selector:
    matchLabels:
      app: envoy
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8002"
        prometheus.io/path: "/stats/prometheus"
      labels:
        app: envoy
    spec:
      containers:
      - command:
        - /bin/contour
        args:
          - envoy
          - shutdown-manager
#        image: docker.io/projectcontour/contour:v1.11.0
        image: {{ .Values.global.registry1 }}/projectcontour/contour:{{ .Values.contour }}
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
                - /bin/contour
                - envoy
                - shutdown
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8090
          initialDelaySeconds: 3
          periodSeconds: 10
        name: shutdown-manager
              - args:
        - -c
        - /config/envoy.json
        - --service-cluster $(CONTOUR_NAMESPACE)
        - --service-node $(ENVOY_POD_NAME)
        - --log-level info
        command:
        - envoy
#        image: image: docker.io/envoyproxy/envoy:v1.16.2
        image: {{ .Values.global.registry }}/{{ .Values.imageRepo }}:{{ .Values.imageTag }}-ic
        imagePullPolicy: IfNotPresent
        name: envoy
        env:
        - name: CONTOUR_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: ENVOY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        ports:
        - containerPort: 80
          hostPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          hostPort: 443
          name: https
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /ready
            port: 8002
          initialDelaySeconds: 3
          periodSeconds: 4
        volumeMounts:
          - name: envoy-config
            mountPath: /config
          - name: envoycert
            mountPath: /certs
        lifecycle:
          preStop:
            httpGet:
              path: /shutdown
              port: 8090
              scheme: HTTP
                    initContainers:
      - args:
        - bootstrap
        - /config/envoy.json
        - --xds-address=contour
        - --xds-port=8001
        - --xds-resource-version=v3
        - --resources-dir=/config/resources
        - --envoy-cafile=/certs/ca.crt
        - --envoy-cert-file=/certs/tls.crt
        - --envoy-key-file=/certs/tls.key
        command:
        - contour
#        image: docker.io/projectcontour/contour:v1.11.0
        image: {{ .Values.global.registry1 }}/projectcontour/contour:{{ .Values.contour }}
        imagePullPolicy: IfNotPresent
        name: envoy-initconfig
        volumeMounts:
        - name: envoy-config
          mountPath: /config
        - name: envoycert
          mountPath: /certs
          readOnly: true
        env:
        - name: CONTOUR_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
##### DEBUG
        securityContext:
          runAsNonRoot: true
          runAsUser: 4444
          runAsGroup: 4444
##### /DEBUG

      automountServiceAccountToken: false
      serviceAccountName: envoy
      terminationGracePeriodSeconds: 300
      volumes:
        - name: envoy-config
          emptyDir: {}
        - name: envoycert
          secret:
            secretName: envoycert
      restartPolicy: Always
############# ADDON
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
# see https://projectcontour.io/docs/v1.11.0/deploy-options/#host-networking
      nodeSelector: {is_edge: 'true'}
      tolerations:
      - key: 'is_edge'
        operator: 'Equal'
        value: 'true'
        effect: 'NoExecute'
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: contour
              topologyKey: kubernetes.io/hostname
############# /ADDON

I also suspect the issue is related to the security context workaround.

Thanks !

youngnick commented 3 years ago

Ah, I think that you may need to actually tell the security context that the pod needs to bind to low ports. This can be done either by having the pod run as privileged, or by adding the CAP_NET_BIND_SERVICE capability:

        securityContext:
          runAsNonRoot: true
          runAsUser: 4444
          runAsGroup: 4444
          capabilities:
            add:
            - NET_BIND_SERVICE

When you are not running as root, you can't bind <1024 without that capability.

Note that the PSP you've setup will permit this binding, because it allows all capabilities, but it won't add them for you. That's what the securityContext does.

Edit: A great gotcha with capabilities is that you have to drop the CAP_ from the name referred to everywhere else when you refer to them in Kubernetes config.

vinzo99 commented 3 years ago

Hi @youngnick

Thanks for your suggestion, I tried it but unfortunately I still get the same error.

I guess the capability needs to be added in the envoy container which is the one trying to bind 80 and 443, and not in envoy-initconfig which is run as 4444. I tried and set a second securityContext at the envoy container level with NET_BIND_SERVICE, same error.

I also tried and set a global securityContext for the whole DaemonSet (which should apply to all containers), same.

Since this issue appears to come from the securityContext workaround, maybe we can try an different approach : do you have any hint on a fix that would make envoy-initconfig create /config/resources with the sufficient permissions, and allow us to run the containers as 65534 like we used to ? In our process we might not know the envoy userid when we generate the helm charts anyway, so the securityContext solution is ok for a workaround but not for production.

Thanks !

youngnick commented 3 years ago

We can change the contour bootstrap command to create the config/resources directory as 777, which should solve your problem, I think. Normally, I'd be concerned about setting secret-holding directories to that mode, but in this case, the actual files in that directory are pointers to the actual secrets (which are mounted in from Kubernetes Secrets). So I think it should be okay.

The change should definitely explain why we do that and refer back to this issue though.

abhide commented 3 years ago

@youngnick I have a draft PR out: https://github.com/projectcontour/contour/pull/3390 Will test out the changes with a local build and update here in a day or two.

benmoss commented 3 years ago

Yep, also running into this issue because I'm using the https://hub.docker.com/r/bitnami/envoy image instead of the envoyproxy/envoy one, and so my envoy doesn't run as root

vinzo99 commented 3 years ago

Thanks ! the fix being available in main branch, implies it should also be in next freeze release-1.14, right ?

sunjayBhatia commented 3 years ago

Thanks ! the fix being available in main branch, implies it should also be in next freeze release-1.14, right ?

Yes that is correct