kcp-dev / kcp

Kubernetes-like control planes for form-factors and use-cases beyond Kubernetes and container workloads.
https://kcp.io
Apache License 2.0
2.35k stars 381 forks source link

bug: Syncer deletes downstream object when connection to kcp API service is lost #2671

Closed pdettori closed 1 year ago

pdettori commented 1 year ago

Describe the bug

I have a single pCluster, and I have followed the steps to install a syncer and sync a deployment from a workspace to the pCluster. I have noticed that if I break the network connection between the syncer in the pCluster and the kcp API service the syncer deletes the deployment on the pCluster. This is the message I see in the syncer log as I break the connection:

I0125 02:59:56.398028       1 spec_process.go:168] "Deleting downstream object for upstream object" syncTarget.workspace="root:users:zu:yc:kcp-admin:mycompute" syncTarget.name="control" syncTarget.key="7rzSmzjAZbTuojUVeewmOGHckRDGrJQjwlLare" reconciler="kcp-workload-syncer-spec" key="root:users:zu:yc:kcp-admin:ws1|default/default-token-z9gvc" gvr="/v1, Resource=secrets" workspace="root:users:zu:yc:kcp-admin:ws1" namespace="default" name="default-token-z9gvc" downstream.namespace="kcp-2kqlv8w2cs8k"

When the connection is re-established, the object gets re-created after few seconds.

I am not sure if this is a bug or a feature. Perhaps it may not be a big issue for stateless services (e.g., deployments) but it might be an issue for stateful services.

For example, I have been experimenting with kcp + Crossplane by syncing Crossplane claims to the pCluster where I have installed Crossplane. In one scenario, a claim was creating a RDS database on AWS. Because of this syncer behavior, the RDS database got deleted in AWS every time the syncer lost connection to the KCP service, which is certainly not desirable. Another use case where this is certainly not desirable is edge use cases, where loss of connectivity is common and clusters need to maintain local autonomy - that is, workloads need to continue to operate when network connectivity is lost.

Steps To Reproduce

  1. Create a kind pCluster

    kind create cluster --name test
  2. Download and install kcp v0.10.0 and kubectl plugins

  3. Open a terminal and run the command:

    kcp start
  4. Open another terminal (on same dir) and run the command:

    export KUBECONFIG=.kcp/admin.kubeconfig
  5. Generate the syncer yaml

    kubectl kcp workload sync kind --syncer-image ghcr.io/kcp-dev/kcp/syncer:v0.10.0 -o syncer-kind-main.yaml
  6. Edit syncer file:

    vim   syncer-kind-main.yaml

    In the secret with the kubeconfig, change server to use port 9443 instead of 6443

  7. Take note of the IP used in server in the step above, open another terminal and run the command:

    SRV_IP=<server IP>
    ssh -L ${SRV_IP}:9443:${SRV_IP}:6443 -N 127.0.0.1

    (you may have to accept ssh connection to host and enter your password)

  8. Deploy the syncer on the pCluster

    KUBECONFIG=</path/to/kind/kubeconfig> kubectl apply -f "syncer-kind-main.yaml"
  9. Check syncer is started on pCluster and check if logs are ok

    KUBECONFIG=</path/to/kind/kubeconfig> kubectl get pods -A | grep kcp-sync
    KUBECONFIG=</path/to/kind/kubeconfig> kubectl logs -n $SYNCER_NS  $SYNCER_POD -f
  10. Bind compute

    kubectl kcp bind compute root
  11. Create deployment in kcp

    kubectl create deployment --image=gcr.io/kuar-demo/kuard-amd64:blue --port=8080 kuard
  12. Verify deployment started both in kcp and pCluster

    kubectl get deployments
    KUBECONFIG=</path/to/kind/kubeconfig> kubectl get deployments -A | grep kuard
  13. Break connection (ctrl+C on terminal where the port forwarding command was running

  14. In about 45 s the deployment and containing namespace was deleted on the pCLuster

    kubectl get deployment -n kcp-1e79kpfk7yyx kuard
    Error from server (NotFound): namespaces "kcp-1e79kpfk7yyx" not found

    The namespace containing the deployment was deleted as well:

    kubectl get ns
    NAME                       STATUS   AGE
    default                    Active   95m
    kcp-syncer-kind-2r15tzoq   Active   18m
    kube-node-lease            Active   95m
    kube-public                Active   95m
    kube-system                Active   95m
    local-path-storage         Active   95m

Expected Behaviour

I expect that the resource on the physical cluster is not deleted when the syncer loses network connectivity to the kcp API service.

Additional Context

I have tested with kcp v0.10.0 on a MacBookPro M1

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.2", GitCommit:"5835544ca568b757a8ecae5c153f317e5736700e", GitTreeState:"clean", BuildDate:"2022-09-21T14:25:45Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+kcp-v0.10.0", GitCommit:"25254541", GitTreeState:"clean", BuildDate:"2023-01-25T01:30:14Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}
pdettori commented 1 year ago

I am attaching the full log from the syncer syncer.log

pdettori commented 1 year ago

Tested on main, the build provides this git version:

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3+kcp-v0.10.0-549-g786ebb892ad93e", GitCommit:"786ebb89", GitTreeState:"clean", BuildDate:"2023-01-25T15:05:23Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"darwin/arm64"}

I get the same behavior, this time in 37 s the deployment is deleted. Attaching full log for syncer/ syncer-main.log

pdettori commented 1 year ago

Steps I used to test from main:

$ kind delete cluster --name control && kind create cluster --name control
$ git log
commit 786ebb892ad93ea9228cc736690cbebc642b1fd9 (HEAD -> main, upstream/main)
$ make build
$ PATH=$(pwd)/bin:$PATH  # I have also removed previous kcp binaries from /usr/local/bin
$ brew install ko
$ ko build --local --push=false -B ./cmd/syncer -t $(git rev-parse --short HEAD)
$ kind load docker-image ko.local/syncer:$(git rev-parse --short HEAD) --name control
$ kcp start
# on another terminal
$ export KUBECONFIG=$(pwd)/.kcp/admin.kubeconfig
$ kubectl kcp workload sync control --syncer-image ko.local/syncer:$(git rev-parse --short HEAD) -o ${HOME}/syncer-control.yaml
# now start the port forwarder on another terminal
$ SRV_IP=<server IP>
$ ssh -L ${SRV_IP}:9443:${SRV_IP}:6443 -N 127.0.0.1
# now edit the ${HOME}/syncer-control.yaml to change port to 9443 and start the syncer
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl apply -f  ${HOME}/syncer-control.yaml
# bind compute
$ kubectl kcp bind compute root
# deploy test deployment
$ kubectl create deployment --image=gcr.io/kuar-demo/kuard-amd64:blue --port=8080 kuard
# verify started 
$ kubectl get deployments
$ KUBECONFIG=</path/to/kind/kubeconfig> kubectl get deployments -A | grep kuard

at this point break the connection and verify that after about 40 s the deployment is deleted.

pdettori commented 1 year ago

With help from @jmprusi and @davidfestal we were able to find the root of the issue. In short, the approach I have been using to break the connection did not work correctly, as when the syncer starts it retrieves the virtual workspace URL from kcp, and since kcp was started configured with port 6443 the syncer communicates with both port 9443 and port 6443 with kcp. Breaking the connection on port 9443 breaks the connection used for health checking. Because heart beating is not sent back, the TMC scheduler sees the SyncTarget as unhealthy and removes the state.workload.kcp.dev/<cluster-id> label from the upstream object and containing namespace. The syncer has still a watch on port 6443, so it understand the object and containing namespace should be removed downstream and it does just that.

To perform the test correctly, as suggested by @davidfestal, I started kcp with the external-hostname argument to pass the IP of the host and setting the 9443 port. This way, communication with syncer happens only on port 9443 and breaking the connection works correctly. In this case the scenario described in the issue does not happen, so we can consider this issue resolved.