While investigating a test flake, I noticed that the cluster was stuck in the warming phase. Normally, it exits this state after 15 seconds (the default initial_fetch_timeout), but the test had an Eventually condition set to 10 seconds. During the investigation, I identified an issue
Kuma responds with a CDS version: ee6f90cd-5733-46d1-8c88-c26edec2fbcd at [2024-11-22 14:29:06.299].
The snapshot CDS/EDS/LDS version changes at [2024-11-22 14:29:06.313].
Envoy requests EDS at [2024-11-22 14:29:06.488], but by this time, the version of CDS/EDS has changed(step 2). As a result, the snapshot contains a resource that was not requested by Envoy. The go-control-planevalidates that all resources in a snapshot are requested, leading to the observed issue.
ADS mode: not responding to request type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[demo-client]: "meshretry-http_test-server__kuma-3_msvc_80" not listed```
Logs from control-plane and Envoy
[2024-11-22 14:29:06.001] INFO xds.reconcile config has changed {"proxyName": "demo-client", "mesh": "meshretry-http", "versions": ["ee6f90cd-5733-46d1-8c88-c26edec2fbcd", "4ad0b957-601c-48c0-8a1a-4834e5355bf1", "ed1feeae-8ab0-48db-bc3d-06333c599fc5"]}
[2024-11-22 14:29:06.299] INFO xds.status-tracker config requested {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "Listener", "resourceVersion": "", "resourceNames": [], "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": ""}
[2024-11-22 14:29:06.299] DEBUG xds.server nodeID "meshretry-http.demo-client" requested type.googleapis.com/envoy.config.listener.v3.Listener[] and known map[]. Diff []
[2024-11-22 14:29:06.299] DEBUG xds.server respond type.googleapis.com/envoy.config.listener.v3.Listener[] version "" with version "ed1feeae-8ab0-48db-bc3d-06333c599fc5"
[2024-11-22 14:29:06.299] DEBUG xds.status-tracker config accepted {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "Cluster", "resourceVersion": "9a918deb-57db-4d48-89c9-3a2d4b099a5a", "resourceNames": [], "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": "1"}
[2024-11-22 14:29:06.299] DEBUG xds.status-tracker config sent {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "Listener", "resourceVersion": "ed1feeae-8ab0-48db-bc3d-06333c599fc5", "requestedResourceNames": [], "resourceCount": 8, "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": "2"}
[2024-11-22 14:29:06.299] DEBUG xds.server nodeID "meshretry-http.demo-client" requested type.googleapis.com/envoy.config.cluster.v3.Cluster[] and known map[]. Diff []
[2024-11-22 14:29:06.299] DEBUG xds.server respond type.googleapis.com/envoy.config.cluster.v3.Cluster[] version "9a918deb-57db-4d48-89c9-3a2d4b099a5a" with version "ee6f90cd-5733-46d1-8c88-c26edec2fbcd"
[2024-11-22 14:29:06.313] INFO xds.reconcile config has changed {"proxyName": "demo-client", "mesh": "meshretry-http", "versions": ["735f01b7-ed21-4247-bf0d-abace073d887", "3cff12ab-9568-4b44-9710-a648f16c013e", "68dec4d0-291d-4d51-b29f-cfc833302e96"]}
[2024-11-22 14:29:06.484][132][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:857] add/update cluster demo-client starting warming
[2024-11-22 14:29:06.486] DEBUG xds.status-tracker config accepted {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "Listener", "resourceVersion": "ed1feeae-8ab0-48db-bc3d-06333c599fc5", "resourceNames": [], "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": "2"}
[2024-11-22 14:29:06.487] DEBUG xds.server nodeID "meshretry-http.demo-client" requested type.googleapis.com/envoy.config.listener.v3.Listener[] and known map[]. Diff []
[2024-11-22 14:29:06.487] DEBUG xds.server respond type.googleapis.com/envoy.config.listener.v3.Listener[] version "ed1feeae-8ab0-48db-bc3d-06333c599fc5" with version "68dec4d0-291d-4d51-b29f-cfc833302e96"
[2024-11-22 14:29:06.488] INFO xds.status-tracker config requested {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "ClusterLoadAssignment", "resourceVersion": "", "resourceNames": ["demo-client"], "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": ""}
[2024-11-22 14:29:06.488] DEBUG xds.server nodeID "meshretry-http.demo-client" requested type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[demo-client] and known map[]. Diff [demo-client]
[2024-11-22 14:29:06.488] DEBUG xds.server ADS mode: not responding to request type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment[demo-client]: "meshretry-http_test-server__kuma-3_msvc_80" not listed
[2024-11-22 14:29:06.488] DEBUG xds.status-tracker config sent {"proxyName": "demo-client", "mesh": "meshretry-http", "streamID": 26, "type": "Listener", "resourceVersion": "68dec4d0-291d-4d51-b29f-cfc833302e96", "requestedResourceNames": [], "resourceCount": 9, "subscriptionID": "a1987ec2-d9fd-49d7-b317-2b47e01d9f1a", "nonce": "4"}
User impact
A user might notice a delay in endpoint updates. When a new service is added, it can take approximately 15 seconds for the change to be reflected, along with any other changes in the cluster.
What happened?
While investigating a test flake, I noticed that the cluster was stuck in the warming phase. Normally, it exits this state after 15 seconds (the default
initial_fetch_timeout
), but the test had an Eventually condition set to 10 seconds. During the investigation, I identified an issueee6f90cd-5733-46d1-8c88-c26edec2fbcd
at [2024-11-22 14:29:06.299].go-control-plane
validates that all resources in a snapshot are requested, leading to the observed issue.Logs from control-plane and Envoy
User impact
A user might notice a delay in endpoint updates. When a new service is added, it can take approximately 15 seconds for the change to be reflected, along with any other changes in the cluster.