Closed pdettori closed 1 year ago
I am attaching the full log from the syncer syncer.log
Tested on main, the build provides this git version:
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+kcp-v0.10.0-549-g786ebb892ad93e", GitCommit:"786ebb89", GitTreeState:"clean", BuildDate:"2023-01-25T15:05:23Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}
I get the same behavior, this time in 37 s the deployment is deleted. Attaching full log for syncer/ syncer-main.log
Steps I used to test from main:
$ kind delete cluster --name control && kind create cluster --name control
$ git log
commit 786ebb892ad93ea9228cc736690cbebc642b1fd9 (HEAD -> main, upstream/main)
$ make build
$ PATH=$(pwd)/bin:$PATH # I have also removed previous kcp binaries from /usr/local/bin
$ brew install ko
$ ko build --local --push=false -B ./cmd/syncer -t $(git rev-parse --short HEAD)
$ kind load docker-image ko.local/syncer:$(git rev-parse --short HEAD) --name control
$ kcp start
# on another terminal
$ export KUBECONFIG=$(pwd)/.kcp/admin.kubeconfig
$ kubectl kcp workload sync control --syncer-image ko.local/syncer:$(git rev-parse --short HEAD) -o ${HOME}/syncer-control.yaml
# now start the port forwarder on another terminal
$ SRV_IP=<server IP>
$ ssh -L ${SRV_IP}:9443:${SRV_IP}:6443 -N 127.0.0.1
# now edit the ${HOME}/syncer-control.yaml to change port to 9443 and start the syncer
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl apply -f ${HOME}/syncer-control.yaml
# bind compute
$ kubectl kcp bind compute root
# deploy test deployment
$ kubectl create deployment --image=gcr.io/kuar-demo/kuard-amd64:blue --port=8080 kuard
# verify started
$ kubectl get deployments
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl get deployments -A | grep kuard
at this point break the connection and verify that after about 40 s the deployment is deleted.
With help from @jmprusi and @davidfestal we were able to find the root of the issue. In short, the approach I have been using to break the connection did not work correctly, as when the syncer starts it retrieves the virtual workspace URL from kcp, and since kcp was started configured with port 6443 the syncer communicates with both port 9443 and port 6443 with kcp. Breaking the connection on port 9443 breaks the connection used for health checking. Because heart beating is not sent back, the TMC scheduler sees the SyncTarget
as unhealthy and removes the state.workload.kcp.dev/<cluster-id>
label from the upstream object and containing namespace. The syncer has still a watch on port 6443, so it understand the object and containing namespace should be removed downstream and it does just that.
To perform the test correctly, as suggested by @davidfestal, I started kcp with the external-hostname
argument to pass the IP of the host and setting the 9443 port. This way, communication with syncer happens only on port 9443 and breaking the connection works correctly. In this case the scenario described in the issue does not happen, so we can consider this issue resolved.
Describe the bug
I have a single pCluster, and I have followed the steps to install a syncer and sync a deployment from a workspace to the pCluster. I have noticed that if I break the network connection between the syncer in the pCluster and the kcp API service the syncer deletes the deployment on the pCluster. This is the message I see in the syncer log as I break the connection:
When the connection is re-established, the object gets re-created after few seconds.
I am not sure if this is a bug or a feature. Perhaps it may not be a big issue for stateless services (e.g., deployments) but it might be an issue for stateful services.
For example, I have been experimenting with kcp + Crossplane by syncing Crossplane claims to the pCluster where I have installed Crossplane. In one scenario, a claim was creating a RDS database on AWS. Because of this syncer behavior, the RDS database got deleted in AWS every time the syncer lost connection to the KCP service, which is certainly not desirable. Another use case where this is certainly not desirable is edge use cases, where loss of connectivity is common and clusters need to maintain local autonomy - that is, workloads need to continue to operate when network connectivity is lost.
Steps To Reproduce
Create a kind pCluster
Download and install kcp v0.10.0 and kubectl plugins
Open a terminal and run the command:
Open another terminal (on same dir) and run the command:
Generate the syncer yaml
Edit syncer file:
In the secret with the kubeconfig, change
server
to use port9443
instead of6443
Take note of the IP used in
server
in the step above, open another terminal and run the command:(you may have to accept ssh connection to host and enter your password)
Deploy the syncer on the pCluster
Check syncer is started on pCluster and check if logs are ok
Bind compute
Create deployment in kcp
Verify deployment started both in kcp and pCluster
Break connection (ctrl+C on terminal where the port forwarding command was running
In about 45 s the deployment and containing namespace was deleted on the pCLuster
The namespace containing the deployment was deleted as well:
Expected Behaviour
I expect that the resource on the physical cluster is not deleted when the syncer loses network connectivity to the kcp API service.
Additional Context
I have tested with kcp v0.10.0 on a MacBookPro M1