bug: Syncer deletes downstream object when connection to kcp API service is lost

pdettori commented 1 year ago

Describe the bug

I have a single pCluster, and I have followed the steps to install a syncer and sync a deployment from a workspace to the pCluster. I have noticed that if I break the network connection between the syncer in the pCluster and the kcp API service the syncer deletes the deployment on the pCluster. This is the message I see in the syncer log as I break the connection:

I0125 02:59:56.398028       1 spec_process.go:168] "Deleting downstream object for upstream object" syncTarget.workspace="root:users:zu:yc:kcp-admin:mycompute" syncTarget.name="control" syncTarget.key="7rzSmzjAZbTuojUVeewmOGHckRDGrJQjwlLare" reconciler="kcp-workload-syncer-spec" key="root:users:zu:yc:kcp-admin:ws1|default/default-token-z9gvc" gvr="/v1, Resource=secrets" workspace="root:users:zu:yc:kcp-admin:ws1" namespace="default" name="default-token-z9gvc" downstream.namespace="kcp-2kqlv8w2cs8k"

When the connection is re-established, the object gets re-created after few seconds.

I am not sure if this is a bug or a feature. Perhaps it may not be a big issue for stateless services (e.g., deployments) but it might be an issue for stateful services.

For example, I have been experimenting with kcp + Crossplane by syncing Crossplane claims to the pCluster where I have installed Crossplane. In one scenario, a claim was creating a RDS database on AWS. Because of this syncer behavior, the RDS database got deleted in AWS every time the syncer lost connection to the KCP service, which is certainly not desirable. Another use case where this is certainly not desirable is edge use cases, where loss of connectivity is common and clusters need to maintain local autonomy - that is, workloads need to continue to operate when network connectivity is lost.

Steps To Reproduce

Create a kind pCluster
```
kind create cluster --name test
```
Download and install kcp v0.10.0 and kubectl plugins
Open a terminal and run the command:
```
kcp start
```
Open another terminal (on same dir) and run the command:
```
export KUBECONFIG=.kcp/admin.kubeconfig
```

Generate the syncer yaml

kubectl kcp workload sync kind --syncer-image ghcr.io/kcp-dev/kcp/syncer:v0.10.0 -o syncer-kind-main.yaml

Edit syncer file:
```
vim   syncer-kind-main.yaml
```
In the secret with the kubeconfig, change server to use port 9443 instead of 6443
Take note of the IP used in server in the step above, open another terminal and run the command:
```
SRV_IP=<server IP>
ssh -L ${SRV_IP}:9443:${SRV_IP}:6443 -N 127.0.0.1
```
(you may have to accept ssh connection to host and enter your password)

Deploy the syncer on the pCluster

KUBECONFIG=</path/to/kind/kubeconfig> kubectl apply -f "syncer-kind-main.yaml"

Check syncer is started on pCluster and check if logs are ok

KUBECONFIG=</path/to/kind/kubeconfig> kubectl get pods -A | grep kcp-sync
KUBECONFIG=</path/to/kind/kubeconfig> kubectl logs -n $SYNCER_NS  $SYNCER_POD -f

Bind compute
```
kubectl kcp bind compute root
```

Create deployment in kcp

kubectl create deployment --image=gcr.io/kuar-demo/kuard-amd64:blue --port=8080 kuard

Verify deployment started both in kcp and pCluster

kubectl get deployments
KUBECONFIG=</path/to/kind/kubeconfig> kubectl get deployments -A | grep kuard

Break connection (ctrl+C on terminal where the port forwarding command was running

In about 45 s the deployment and containing namespace was deleted on the pCLuster

kubectl get deployment -n kcp-1e79kpfk7yyx kuard
Error from server (NotFound): namespaces "kcp-1e79kpfk7yyx" not found

The namespace containing the deployment was deleted as well:

kubectl get ns
NAME                       STATUS   AGE
default                    Active   95m
kcp-syncer-kind-2r15tzoq   Active   18m
kube-node-lease            Active   95m
kube-public                Active   95m
kube-system                Active   95m
local-path-storage         Active   95m

Expected Behaviour

I expect that the resource on the physical cluster is not deleted when the syncer loses network connectivity to the kcp API service.

Additional Context

I have tested with kcp v0.10.0 on a MacBookPro M1

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:25:45Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+kcp-v0.10.0", GitCommit:"25254541", GitTreeState:"clean", BuildDate:"2023-01-25T01:30:14Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}

pdettori commented 1 year ago

I am attaching the full log from the syncer syncer.log

pdettori commented 1 year ago

Tested on main, the build provides this git version:

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+kcp-v0.10.0-549-g786ebb892ad93e", GitCommit:"786ebb89", GitTreeState:"clean", BuildDate:"2023-01-25T15:05:23Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}

I get the same behavior, this time in 37 s the deployment is deleted. Attaching full log for syncer/ syncer-main.log

pdettori commented 1 year ago

Steps I used to test from main:

$ kind delete cluster --name control && kind create cluster --name control
$ git log
commit 786ebb892ad93ea9228cc736690cbebc642b1fd9 (HEAD -> main, upstream/main)
$ make build
$ PATH=$(pwd)/bin:$PATH  # I have also removed previous kcp binaries from /usr/local/bin
$ brew install ko
$ ko build --local --push=false -B ./cmd/syncer -t $(git rev-parse --short HEAD)
$ kind load docker-image ko.local/syncer:$(git rev-parse --short HEAD) --name control
$ kcp start
# on another terminal
$ export KUBECONFIG=$(pwd)/.kcp/admin.kubeconfig
$ kubectl kcp workload sync control --syncer-image ko.local/syncer:$(git rev-parse --short HEAD) -o ${HOME}/syncer-control.yaml
# now start the port forwarder on another terminal
$ SRV_IP=<server IP>
$ ssh -L ${SRV_IP}:9443:${SRV_IP}:6443 -N 127.0.0.1
# now edit the ${HOME}/syncer-control.yaml to change port to 9443 and start the syncer
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl apply -f  ${HOME}/syncer-control.yaml
# bind compute
$ kubectl kcp bind compute root
# deploy test deployment
$ kubectl create deployment --image=gcr.io/kuar-demo/kuard-amd64:blue --port=8080 kuard
# verify started 
$ kubectl get deployments
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl get deployments -A | grep kuard

at this point break the connection and verify that after about 40 s the deployment is deleted.

pdettori commented 1 year ago

With help from @jmprusi and @davidfestal we were able to find the root of the issue. In short, the approach I have been using to break the connection did not work correctly, as when the syncer starts it retrieves the virtual workspace URL from kcp, and since kcp was started configured with port 6443 the syncer communicates with both port 9443 and port 6443 with kcp. Breaking the connection on port 9443 breaks the connection used for health checking. Because heart beating is not sent back, the TMC scheduler sees the SyncTarget as unhealthy and removes the state.workload.kcp.dev/<cluster-id> label from the upstream object and containing namespace. The syncer has still a watch on port 6443, so it understand the object and containing namespace should be removed downstream and it does just that.

To perform the test correctly, as suggested by @davidfestal, I started kcp with the external-hostname argument to pass the IP of the host and setting the 9443 port. This way, communication with syncer happens only on port 9443 and breaking the connection works correctly. In this case the scenario described in the issue does not happen, so we can consider this issue resolved.

kcp-dev / kcp