loft-sh / vcluster

vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces.
https://www.vcluster.com
Apache License 2.0
6.31k stars 403 forks source link

vCluster creating more trouble than helping(due to different causes) #1787

Open MichaelKora opened 4 months ago

MichaelKora commented 4 months ago

What happened?

I honestly if it is supposed to be that hard but vcluster is creating more trouble than solutions...i've been working on getting a prod ready cluster for over a week and its not working. Right now the cluster is up and running and connect to it using NodePort Service... the issues:

  1. When i deploy the cluster, it takes over 60min for the cluster Pods to be running in healthy state , so i can communicate with the vcluster
  2. the coredns pod though in state running is full of errors:
    
    [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized

[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list v1.Service: Unauthorized [ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch v1.Service: failed to list *v1.Service: Unauthorized

[INFO] plugin/kubernetes: Trace[944124959]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (24-May-2024 11:55:04.692) (total time: 16726ms): Trace[944124959]: ---"Objects listed" error: 16726ms (11:55:21.419) Trace[944124959]: [16.726980037s] [16.726980037s] END

[INFO] plugin/kubernetes: Trace[1216864093]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (24-May-2024 11:54:50.867) (total time: 30559ms): = Trace[1216864093]: ---"Objects listed" error: 30559ms (11:55:21.426) Trace[1216864093]: [30.55934034s] [30.55934034s] END

[INFO] plugin/kubernetes: Trace[1087029931]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (24-May-2024 11:54:49.127) (total time: 32304ms): Trace[1087029931]: ---"Objects listed" error: 32304ms (11:55:21.432) Trace[1087029931]: [32.304771091s] [32.304771091s] END


3. When connected to the vcluster, requests delivers different responses each time:
EG: Running `kubectl get namespaces` might show 4 Namespaces; then 4, and then 6 etc.

4. Running helm on the vcluster is nearly impossible..it times out nearly every single time..

This is all bordering because i was expecting vCluster way easier to use.

### What did you expect to happen?

i deployed a cluster and expected the coreDNS pods to be deployed

### How can we reproduce it (as minimally and precisely as possible)?

```yaml
# vcluster.yaml
exportKubeConfig:
  context: "sharedpool-context"
controlPlane:
  coredns:
    enabled: true
    embedded: false
    deployment:
      replicas: 2
      nodeSelector:
        workload: wk1
  statefulSet:
    highAvailability:
      replicas: 2
    persistence:
      volumeClaim:
        enabled: true
    scheduling:
      nodeSelector:
        workload: wk1
    resources:
      limits:
        ephemeral-storage: 20Gi
        memory: 10Gi
      requests:
        ephemeral-storage: 200Mi
        cpu: 200m
        memory: 256Mi

  proxy:
    bindAddress: "0.0.0.0"
    port: 8443
    extraSANs:
      - XX.XX.XX.XXX
      - YY.YY.YY.YYY
helm upgrade -i my-vcluster vcluster \
  --repo https://charts.loft.sh \
  --namespace vcluster-ns --create-namespace \
  --repository-config='' \
  -f vcluster.yaml \
  --version 0.20.0-beta.5

Anything else we need to know?

i used a nodePort service to connect to the cluster

# nodeport.yaml

apiVersion: v1
kind: Service
metadata:
  name: vcluster-nodeport
  namespace: vcluster-ns
spec:
  selector:
    app: vcluster
    release: shared-pool-vcluster
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
      nodePort: 31222
  type: NodePort
$ helm upgrade -i solr-operator apache-solr/solr-operator --version 0.8.1 -n solr-cloud

Release "solr-operator" does not exist. Installing it now.
Error: failed post-install: 1 error occurred:
        * timed out waiting for the condition

$ helm upgrade -i hz-operator hazelcast/hazelcast-platform-operator -n hz-vc-ns --create-namespace -f operator.yaml

Release "hz-operator" does not exist. Installing it now.
Error: 9 errors occurred:
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Internal error occurred: resource quota evaluation timed out

Host cluster Kubernetes version

```console $ kubectl version # paste output here ```

Host cluster Kubernetes distribution

``` Client Version: v1.29.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.11 ```

vlcuster version

```console $ vcluster --version vcluster version 0.19.5 ```

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

``` default.. # i did no specify a speciffc distribution ```

OS and Arch

``` OS: talos Arch: metal-amd64 ```
heiko-braun commented 4 months ago

hey @MichaelKora , it's unfortunate that have to experience these troubles.

One thing that I'd recommend is to use the latest vcluster CLI, together with 0.20.0-beta.5. From the description it appears that you using the 0.19.5 one instead.

Regarding the other issues: It's a bit hard to say from the outset what might causing your issues. You seem to be leveraging Talos. What Kubernetes distro is running on top of it?

MichaelKora commented 4 months ago

hey @heiko-braun 0.19.5 is the latest according to vcluster cli

$ sudo vcluster upgrade
15:55:48 info Current binary is the latest version: 0.19.5

i have a TalosRunning there( the default image) its based on k3s

heiko-braun commented 4 months ago

Hi @MichaelKora, you can get the latest CLI (the one to be used with 0.20 vcluster.yaml) here: https://github.com/loft-sh/vcluster/releases/tag/v0.20.0-beta.6

heiko-braun commented 4 months ago

@MichaelKora the hazelcast and solr examples in the description, did you run the commands against the host cluster or the virtual one?

MichaelKora commented 4 months ago

@heiko-braun thanks for your response I run the command against the vcluster... when run against the host cluster, I have no issues

MichaelKora commented 4 months ago

@heiko-braun when the cluster is being created, the logs show:

 2024-06-05 14:38:37 INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"} 

 2024-06-05 14:38:38    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), wil l retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:39    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:40    INFO    commandwriter/commandwriter.go:126    error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 127.0.0.1:6443: connect: connection refused    {"component": "vcluster", "component": "controller-manager", "location": "leaderelection.go:33 2"}

and it takes more than 60min before bringing the cluster to an healthy state..that seems verry odd to me that it takes that long to create a virt cluster

everflux commented 3 months ago

@MichaelKora how many nodes does your host cluster have, and what capacity? do you use network policies?

MichaelKora commented 3 months ago

hey @everflux i dedicated 2nodes of the host cluster to the vcluster...8cpu/32GB..i am not using any restrictive network policies

everflux commented 3 months ago

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

deniseschannon commented 1 month ago

@MichaelKora Are you still having issues or were you able to resolve them?

MichaelKora commented 4 weeks ago

hey @deniseschannon, yes i am still having the issue!

MichaelKora commented 4 weeks ago

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

@everflux i have just one setup