loft-sh / vcluster

vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces.
https://www.vcluster.com
Apache License 2.0
6.32k stars 403 forks source link

vcluster with plugin deployment enabled fails from 0.9.1 to 0.12.0 #772

Closed anupshandilya closed 1 year ago

anupshandilya commented 2 years ago

What happened?

Hi,

I am opening this issue to request support on what could be done to solve this problem. I appreciate your valuable advise.

I am trying to deploy vcluster 0.12.0 with crd-sync plugin added as a helm chart on OpenShift. syncer container is blocked waiting for crd-sync plugin to register. crd-sync container is trying to connect with vcluster but is unable to do it(looks like). This situation persists and thus causes a liveness/readiness check failures on the syncer endpoint causing the pod to be restarted. This works fine on vcluster 0.8.1. The issue is from 0.9.x onwards.

$ oc get pods
NAME         READY   STATUS    RESTARTS      AGE
vcluster-0   2/3     Running   1 (43s ago)   4m7s
$ oc logs vcluster-0 -c syncer
I0929 05:48:48.565270       1 start.go:297] Start Plugins Manager...
I0929 05:48:48.567566       1 plugin.go:164] Plugin server listening on localhost:10099
I0929 05:48:48.568167       1 plugin.go:185] Waiting for plugin crd-sync to register...
$ oc logs vcluster-0 -c crd-sync
I0929 05:39:55.605245       1 logr.go:249] plugin: Try creating context...
$ helm ls
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
vcluster        anup            1               2022-09-29 05:38:50.983274855 +0000 UTC deployed        vcluster-0.12.0          
$ helm get values vcluster
USER-SUPPLIED VALUES:
ingress:
  annotations:
    route.openshift.io/termination: passthrough
  apiVersion: networking.k8s.io/v1
  enabled: false
  host: vcluster.apps.openshift.com.net
  ingressClassName: ""
  pathType: ImplementationSpecific
openshift:
  enable: true
plugin:
  crd-sync:
    image: ghcr.io/loft-sh/vcluster-example-crd-sync:latest
    imagePullPolicy: IfNotPresent
    rbac:
      clusterRole:
        extraRules:
        - apiGroups:
          - apiextensions.k8s.io
          resources:
          - customresourcedefinitions
          verbs:
          - get
          - list
          - watch
      role:
        extraRules:
        - apiGroups:
          - example.loft.sh
          resources:
          - cars
          verbs:
          - create
          - delete
          - patch
          - update
          - get
          - list
          - watch
rbac:
  clusterRole:
    create: true
serviceCIDR: 172.30.0.0/16
sync:
  endpoints:
    enabled: true
syncer:
  extraArgs:
  - --disable-fake-kubelets=false
$ oc describe pod vcluster-0
Name:         vcluster-0
Namespace:    anup
Priority:     0
Node:         xxxxxx/10.0.0.19
Start Time:   Thu, 29 Sep 2022 05:39:19 +0000
Labels:       app=vcluster
              controller-revision-hash=vcluster-ffff97f84
              release=vcluster
              statefulset.kubernetes.io/pod-name=vcluster-0
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["10.130.3.252/23"],"mac_address":"0a:58:0a:82:03:fc","gateway_ips":["10.130.2.1"],"ip_address":"10.130.3.252/2...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.130.3.252"
                    ],
                    "mac": "0a:58:0a:82:03:fc",
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "10.130.3.252"
                    ],
                    "mac": "0a:58:0a:82:03:fc",
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: anyuid
Status:       Running
IP:           10.130.3.252
IPs:
  IP:           10.130.3.252
Controlled By:  StatefulSet/vcluster
Containers:
  vcluster:
    Container ID:  cri-o://4136c5e85a07c645bb1b27e951e92c543b86aaf0818f82ea440d6a8fdd975e1e
    Image:         rancher/k3s:v1.25.0-k3s1
    Image ID:      rancher/k3s@sha256:afb3b3a49a8d4f411222ff42ee0f6548c97e1f5e2e7407be92b52efa7cee8c79
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      /bin/k3s server --write-kubeconfig=/data/k3s-config/kube-config.yaml --data-dir=/data --disable=traefik,servicelb,metrics-server,local-storage,coredns --disable-network-policy --disable-agent --disable-cloud-controller --flannel-backend=none --disable-scheduler --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle,-ttl --kube-apiserver-arg=endpoint-reconciler-type=none --service-cidr=172.30.0.0/16 && true
    State:          Running
      Started:      Thu, 29 Sep 2022 05:39:40 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  2Gi
    Requests:
      cpu:        200m
      memory:     256Mi
    Environment:  <none>
    Mounts:
      /data from data (rw)
      /etc/rancher from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nfdgq (ro)
  syncer:
    Container ID:  cri-o://f8037c6622f534417c04b274f73e2f888c4c399b779850d06b86735bc701b51e
    Image:         loftsh/vcluster:0.12.0
    Image ID:      loftsh/vcluster@sha256:18789aaa7f07776f5813d57c6f7cff7eb0aa555d6a49c8f62b23acf54e5b8178
    Port:          <none>
    Host Port:     <none>
    Args:
      --name=vcluster
      --service-account=vc-workload-vcluster
      --plugins=crd-sync
      --default-image-registry=
      --kube-config-context-name=my-vcluster
      --disable-fake-kubelets=false
    State:          Running
      Started:      Thu, 29 Sep 2022 05:57:48 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 29 Sep 2022 05:54:48 +0000
      Finished:     Thu, 29 Sep 2022 05:57:48 +0000
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:      20m
      memory:   64Mi
    Liveness:   http-get https://:8443/healthz delay=60s timeout=1s period=2s #success=1 #failure=60
    Readiness:  http-get https://:8443/readyz delay=0s timeout=1s period=2s #success=1 #failure=60
    Environment:
      POD_IP:               (v1:status.podIP)
      VCLUSTER_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /data from data (ro)
      /manifests/coredns from coredns (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nfdgq (ro)
  crd-sync:
    Container ID:   cri-o://0f192ba403edfee3fa5cd92b4b785f20b42730ab613f208a81b628e57c5d08a1
    Image:          ghcr.io/loft-sh/vcluster-example-crd-sync:latest
    Image ID:       ghcr.io/loft-sh/vcluster-example-crd-sync@sha256:b315022ea140659be7d97e4b0dff28c8818254ed34d514ec050a86712ae67c88
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 29 Sep 2022 05:39:55 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      VCLUSTER_PLUGIN_ADDRESS:  localhost:14000
      VCLUSTER_PLUGIN_NAME:     crd-sync
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nfdgq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-vcluster-0
    ReadOnly:   false
  config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  coredns:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      vcluster-coredns
    Optional:  false
  kube-api-access-nfdgq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               20m                    default-scheduler        Successfully assigned anup/vcluster-0 to xxxxx by yyyyy
  Normal   SuccessfulAttachVolume  20m                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-11c6700d-c810-4dfd-881a-bb5cb5708a40"
  Normal   Pulling                 20m                    kubelet                  Pulling image "rancher/k3s:v1.25.0-k3s1"
  Normal   AddedInterface          20m                    multus                   Add eth0 [10.130.3.252/23] from ovn-kubernetes
  Normal   Started                 20m                    kubelet                  Started container vcluster
  Normal   Pulled                  20m                    kubelet                  Successfully pulled image "rancher/k3s:v1.25.0-k3s1" in 10.134186796s
  Normal   Created                 20m                    kubelet                  Created container vcluster
  Normal   Pulling                 20m                    kubelet                  Pulling image "loftsh/vcluster:0.12.0"
  Normal   Created                 20m                    kubelet                  Created container syncer
  Normal   Pulled                  20m                    kubelet                  Successfully pulled image "loftsh/vcluster:0.12.0" in 9.208709305s
  Normal   Started                 20m                    kubelet                  Started container syncer
  Normal   Pulling                 20m                    kubelet                  Pulling image "ghcr.io/loft-sh/vcluster-example-crd-sync:latest"
  Normal   Pulled                  20m                    kubelet                  Successfully pulled image "ghcr.io/loft-sh/vcluster-example-crd-sync:latest" in 5.331714598s
  Normal   Created                 20m                    kubelet                  Created container crd-sync
  Normal   Started                 20m                    kubelet                  Started container crd-sync
  Warning  Unhealthy               5m43s (x292 over 19m)  kubelet                  Liveness probe failed: Get "https://10.130.3.252:8443/healthz": dial tcp 10.130.3.252:8443: connect: connection refused
  Warning  Unhealthy               43s (x623 over 20m)    kubelet                  Readiness probe failed: Get "https://10.130.3.252:8443/readyz": dial tcp 10.130.3.252:8443: connect: connection refused
$ oc exec -it vcluster-0 -- sh
Defaulted container "vcluster" out of: vcluster, syncer, crd-sync

/ # netstat | grep 10099
tcp        0      0 localhost:47888         localhost:10099         TIME_WAIT
tcp        0      0 localhost:33776         localhost:10099         TIME_WAIT
tcp        0      0 localhost:54666         localhost:10099         TIME_WAIT
tcp        0      0 localhost:46744         localhost:10099         TIME_WAIT
tcp        0      0 localhost:50944         localhost:10099         TIME_WAIT
tcp        0      0 localhost:54662         localhost:10099         TIME_WAIT
tcp        0      0 localhost:42024         localhost:10099         TIME_WAIT
tcp        0      0 localhost:51122         localhost:10099         TIME_WAIT
tcp        0      0 localhost:46730         localhost:10099         TIME_WAIT
tcp        0      0 localhost:51112         localhost:10099         TIME_WAIT
tcp        0      0 localhost:47876         localhost:10099         TIME_WAIT
tcp        0      0 localhost:50960         localhost:10099         TIME_WAIT

Link to crd-sync plugin : https://github.com/loft-sh/vcluster-sdk/tree/main/examples/crd-sync

What did you expect to happen?

crd-sync plugin was successfully deployed. vcluster pods are up and in running state.

How can we reproduce it (as minimally and precisely as possible)?

wget https://charts.loft.sh/charts/vcluster-0.12.0.tgz tar xvf vcluster-0.12.0.tgz helm install vcluster ./vcluster -f vcluster_values.yaml

Anything else we need to know?

It works fine with vcluster 0.8.1 version. I can connect to vcluster and create cars instance inside vcluster which gets synced onto host OpenShift cluster.

Host cluster Kubernetes version

``` $ kubectl version Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"5e53738b05c16c74fad22e1c1c1c1cc8c0566992", GitTreeState:"clean", BuildDate:"2022-08-29T23:30:12Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0+b62823b", GitCommit:"0a57f1f59bda75ea2cf13d9f3b4ac5d202134f2d", GitTreeState:"clean", BuildDate:"2022-08-19T00:38:59Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"} ```

Host cluster Kubernetes distribution

``` # Write here ```

vlcuster version

``` $ vcluster --version vcluster version 0.12.0 ```

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

``` # k3s and k8s ```

OS and Arch

``` OS: RHEL Linux Arch: amd64 ```
ishankhare07 commented 2 years ago

hey @anupshandilya , thanks for creating this issue. I suspect this is because the upstream images that the plugin.yaml refers to currently are quiet old and since then we have changed the plugin registration mechanism with the vcluster server. I tried the devspace.yaml to deploy the plugin and run the latest code against the created vcluster and it seems to be working fine. I suggest you can try that for the example.

anupshandilya commented 2 years ago

Hi @ishankhare07

Thanks for you response. In the devspace.yaml of crd-sync plugin, I see image of another example - vcluster-example-pull-sycret-sync. Is this fine?

https://github.com/loft-sh/vcluster-sdk/blob/main/examples/crd-sync/devspace.yaml

vars:
  - name: PLUGIN_IMAGE
    value: ghcr.io/loft-sh/vcluster-example-pull-sycret-sync

Did you suggest to try above or use the following image ?

ghcr.io/loft-sh/vcluster-example-crd-sync:latest
ishankhare07 commented 2 years ago

hey @anupshandilya , As long as you're testing against a local cluster it won't matter as you would be building and running the code from inside the opened shell to the pod. In case you want to try this against an upstream cluster, I suggest you can change the PLUGIN_IMAGE value to your own docker hub handle and execute devspace dev -bd this would build and deploy the latest image on your machine to the upstream cluster for you to test.

anupshandilya commented 2 years ago

hi @ishankhare07

Do you mean to say that we need to build the crd-sync plugin latest code to generate the image and use that for testing?

ishankhare07 commented 2 years ago

Hey, sorry for causing the confusion, you can use the ghcr.io/loft-sh/vcluster-example-crd-sync:latest or ghcr.io/loft-sh/vcluster-example-pull-sycret-sync in the devspace, it should not be important. What you will be building is the code itself, not the whole container image. Try these steps:

anupshandilya commented 2 years ago

Hi @ishankhare07

I retested taking latest image - ghcr.io/loft-sh/vcluster-example-crd-sync:latest on OpenShift 4.11 with vcluster 0.12.0 helm chart. Still, experiencing the issue as originally described in the description.

On the other hand, cert-manager-plugin works.

ishankhare07 commented 1 year ago

hey @anupshandilya, as informed earlier these images you're referring to have not been built with the latest code. If you want to try this example you'll have to use the devspace method as suggested previously

anupshandilya commented 1 year ago

Hi @ishankhare07

I see the point now. Thanks. Will check

FabianKramm commented 1 year ago

Closing this for now, if still an issue please feel free to reopen