istio / istio

Connect, secure, control, and observe services.
https://istio.io
Apache License 2.0
35.74k stars 7.7k forks source link

Crash and Failure to Restart East West Gateway in Istio Multicluster Setup Due to Duplicate topology.istio.io/cluster Label Values in destinationRules #50353

Closed omni52 closed 4 months ago

omni52 commented 5 months ago

Is this the right place to submit this?

Bug Description

Environment: Istio Multicluster Setup

Issue: When operating an Istio multicluster setup and creating destinationRules with subsets that reference the topology.istio.io/cluster label, there exists a critical issue where inadvertently setting duplicate values for the topology.istio.io/cluster label causes the East West Gateway to crash after a period. The crash is accompanied by a cryptic error message indicating an update was "but was rejected" by the corresponding istiod. Attempts to redeploy the East West gateway result in failure to start until the problematic DestinationRule is removed. The root cause of this behavior is not immediately obvious, making it difficult for operators to diagnose and rectify the issue.

Symptoms:

  1. East West Gateway crashes with a vague error message related to an update rejection.
  2. The gateway fails to restart post-collision, necessitating the removal of the offending DestinationRule to resume normal operation.
  3. The envoy_lds_update_rejected and envoy_cds_update_rejected metrics for job=istio-eastwestgateway indicate the presence of the problematic artifact.

Expected Behavior: Istio should proactively detect and prevent the creation or updating of DestinationRules with duplicate topology.istio.io/cluster label values, thereby avoiding crashes and restart failures of the East West Gateway.

Steps to Reproduce:

  1. Setup an Istio Multicluster environment.
  2. Create destinationRules with subsets referencing the topology.istio.io/cluster label.
  3. Inadvertently set duplicate values for the topology.istio.io/cluster label in these rules.

Suggested Resolution: Implement validation checks at the time of DestinationRule creation or update to identify and prevent the use of duplicate topology.istio.io/cluster label values. Enhance error messaging to clearly identify the cause of rejections by istiod related to destinationRules configuration issues.

Additional Context: This issue is critical as it not only disrupts the normal operation of the East West Gateway but also hampers the ability to quickly diagnose and resolve the configuration error. Providing a more robust validation mechanism and clearer error messages will significantly improve the operator experience and stability of Istio multicluster setups.

Version

$ istioctl version
# we run versions of istio (but bug exists in all and even if we run only one)
client version: 1.18.5
pilot version: 1.19.8
istiod version: 1.20.4
istiod version: 1.18.6
$ kubectl version
Server Version: 1.26.8

Additional Information

No response

howardjohn commented 5 months ago

Having the actually logs and DestinationRule would be pretty useful for resolving this.

omni52 commented 5 months ago

Hi @howardjohn, no problem - I searched in LOKI for the period we discovered the issue. Logs on the EW GW, were like:

2024-03-23 23:59:47.417 
2024-03-23T22:59:47.417470Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:47.187 
2024-03-23T22:59:47.187699Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:47.009 
2024-03-23T22:59:47.009448Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected
2024-03-23 23:59:34.925 
2024-03-23T22:59:34.924871Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:32.424 
2024-03-23T22:59:32.424731Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:32.197 
2024-03-23T22:59:32.196816Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:32.023 
2024-03-23T22:59:32.023010Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected
2024-03-23 23:59:17.420 
2024-03-23T22:59:17.420138Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:17.169 
2024-03-23T22:59:17.169451Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:17.029 
2024-03-23T22:59:17.029038Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected
2024-03-23 23:59:14.800 
2024-03-23T22:59:14.800518Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected
2024-03-23 23:59:02.406 
2024-03-23T22:59:02.406210Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:02.149 
2024-03-23T22:59:02.148698Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:59:02.018 
2024-03-23T22:59:02.018430Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected
2024-03-23 23:58:47.409 
2024-03-23T22:58:47.409347Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:58:47.145 
2024-03-23T22:58:47.145026Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 417 successful, 0 rejected; lds updates: 0 successful, 419 rejected
2024-03-23 23:58:47.017 
2024-03-23T22:58:47.017729Z warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 415 successful, 0 rejected; lds updates: 0 successful, 417 rejected

and sometime there pop something up like

2024-03-23T23:01:19.668119Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.listener.v3.Listener rejected: Error adding/updating listener(s) 0.0.0.0_15443: error adding listener '0.0.0.0:15443': filter chain '' has the same matching rules defined as ''

a killing DestinationRule is like

spec:
  exportTo:
  - '*'
  host: '*.foobar.svc.cluster.local'
  subsets:
  - labels:
      topology.istio.io/cluster: cluster-a
    name: cluster-a
  - labels:
      topology.istio.io/cluster: cluster-a
    name: cluster-a

Feel free to ask if you need more information. greets uli

howardjohn commented 5 months ago

A few more questions: you have multiple versions, is this from Istiod 1.20?

is 15443 an AUTO_PASSTHROUGH gateway? What is the Gateway spec

omni52 commented 5 months ago

Hi, no Problem. The gateway is defined as you already said:

spec:
  selector:
    istio: eastwestgateway
  servers:
  - hosts:
    - '*.local'
    port:
      name: tls
      number: 15443
      protocol: TLS
    tls:
      mode: AUTO_PASSTHROUGH

We have three istiod / pilots in a parallel setup, 1.18.6, 1.19.8 and 1.20.4 The Gateway corresponding Pod is running on 1.20.4 and connected to a istiod in this version - pilot at version 1.20.4

We use for all the official distroless images, except 1.18.6 for debugging legacy issues.

Currently we are in the upgrade process (1.19.9, 1.20.5 and 1.21.1), maybe I can retest the issue in 1.21.1 the next days. But I didn't not found something about this issue in the latest release notes so I'm quite sure the behaviour still exists.

omni52 commented 5 months ago

The behaviour still exists in 1.21.1, except the fact the pod of eastwest gateway is not hanging at restart process. It now hangs in a CrashLoopBackOff and last log lines are just some infos and warnings:

2024-04-12T05:42:12.525412Z     warn    Envoy proxy is NOT ready: config received from XDS server, but was rejected: cds updates: 1 successful, 0 rejected; lds updates: 0 successful, 1 rejected
2024-04-12T05:42:12.611462Z     info    Agent draining Proxy for termination
2024-04-12T05:42:12.611449Z     info    Status server has successfully terminated
2024-04-12T05:42:12.616383Z     info    Graceful termination period is 5s, starting...
2024-04-12T05:42:17.618282Z     info    Graceful termination period complete, terminating remaining proxies.
2024-04-12T05:42:17.618354Z     warn    Aborting proxy
2024-04-12T05:42:17.618508Z     info    Envoy aborted normally
2024-04-12T05:42:17.618519Z     warn    Aborted proxy instance
2024-04-12T05:42:17.618526Z     info    Agent has successfully terminated
howardjohn commented 5 months ago

The change is probably from the new startupProbe, so it terminates if it cannot start for a while (Ithink 10min)

howardjohn commented 5 months ago

I can reproduce this only if I delete my validation webhook. This is supposed to be rejected by validation, the subset config is illegal

omni52 commented 4 months ago

I think we can close this, the issue seems to be fixed with 1.22 - thanks a lot.

Added validation checks to reject DestinationRules with duplicate subset names.