kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.61k stars 332 forks source link

The universal DP is delayed to generate the XDS configuration #11135

Open Icarus9913 opened 3 weeks ago

Icarus9913 commented 3 weeks ago

What happened?

When I try to figure out how the xds Endpoints data were generated, I found that there's delay for the universal DP XDS configuration generation.

Refer to the XDS callbackchain, the universal DP XDS generation happens in the grpc stream OnStreamRequest.

To put it shortly, the callback functions were call one by one in order

callback {
...
   function dataplane_metadata_tracker                    // store the DP metadata in cache

   function goroutine dataplane_sync_tracker           // watchdog ontick reconciler to generate XDS configs

   function dataplane_lifecycle                                    // create the dataplane object
...
}

With the pseudocode, you can see the second step was called in goroutine and it relies on the third step dataplane object creation. There's a case that the second step try to generate the XDS configs but no dataplane exists, it has to wait for the next time reconciler to do that.(PS. the interval duration is defined by ENV KUMA_XDS_SERVER_DATAPLANE_CONFIGURATION_REFRESH_INTERVAL)

Actually, we can use the metadata.Resource to generate the XDS configurations which is used to create the Dataplane resource at the third step. Also, you can see the function GetDataplaneResource is never called.

Icarus9913 commented 3 weeks ago

I also tested it in my local environment.

2024-08-19T01:23:46.141+0800    INFO    xds.dataplaneSyncTracker    sync/dataplane_watchdog.go:160  Dataplane object not found. Can't regenerate XDS configuration. It's expected during Kubernetes namespace termination. If it persists it's a bug.   {"key": {"Mesh":"default","Name":"dp-echo-1"}}

This log was output by the second step, once the third step callback function was called then it went to well.

lukidzi commented 3 weeks ago

Triage: Let's revert the order and create the Dataplane before the sync call