loft-sh / vcluster

vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces.
https://www.vcluster.com
Apache License 2.0
6.15k stars 369 forks source link

Starting vcluster takes a long time and API response is slow #1900

Open hr-tsu opened 3 days ago

hr-tsu commented 3 days ago

What happened?

When I run the vcluster create command, the command keeps waiting until the timeout, and then an error occurs.

% vcluster create vlsuter-debug
17:31:04 info Create vcluster vlsuter-debug...
17:31:04 info execute command: helm upgrade vlsuter-debug /var/folders/2h/0dl61nlx07n4fylcxw67sngc0000gp/T/vcluster-0.20.0-beta.5.tgz-2989503029 --kubeconfig /var/folders/2h/0dl61nlx07n4fylcxw67sngc0000gp/T/1793836983 --namespace vcluster --install --repository-config='' --values /var/folders/2h/0dl61nlx07n4fylcxw67sngc0000gp/T/1752636891
17:31:06 done Successfully created virtual cluster vlsuter-debug in namespace vcluster
17:32:06 info Waiting for vcluster to come up...
17:32:06 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:32:17 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:32:27 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:32:38 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:32:49 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:33:00 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:0/3
17:33:11 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:1/3
17:33:22 warn vcluster is waiting, because vcluster pod vlsuter-debug-0 has status: Init:1/3
17:41:06 fatal failed to parse kube config: wait for vcluster: client rate limiter Wait returned an error: context deadline exceeded

However, the process seems to be continuing after the error, and after waiting for 2~3h, coredns are created on the host and the vcluster seems to have started (vcluster statefulsets are also running).

% vcluster list
           NAME        |     NAMESPACE      | STATUS  |    VERSION    | CONNECTED |    AGE
  ---------------------+--------------------+---------+---------------+-----------+-------------
    vcluster-debug     | vcluster-tsuchiya  | Running | 0.20.0-beta.5 |           | 19m25s
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:12.531404609 +0000 UTC m=+93.976543691) (total time: 59.425698212s):  SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC  : [[/registry/health false]] {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957183227Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:41.706720112 +0000 UTC m=+123.151859217) (total time: 30.25053982s):  SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC  : [[/registry/resourcequotas/kube-node-lease/% false]]   {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957363891Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:40.679653418 +0000 UTC m=+122.124792499) (total time: 31.277724555s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/controllers/% false]]   {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957477047Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:31.467582359 +0000 UTC m=+112.912721474) (total time: 40.489952153s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/clusterroles/% false]]  {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957598185Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:27.298280599 +0000 UTC m=+108.743419704) (total time: 44.659296202s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/roles/% false]] {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957661015Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:30.262912024 +0000 UTC m=+111.708051125) (total time: 41.694783275s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/namespaces/% false]]    {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957749559Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:42.577832235 +0000 UTC m=+124.022971337) (total time: 29.380034876s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/controllerrevisions/% false]]   {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957930108Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:14.531691988 +0000 UTC m=+95.976831073) (total time: 57.426216029s):  SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC  : [[/registry/health false]] {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.957967405Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:31.211477835 +0000 UTC m=+112.656616940) (total time: 40.746481705s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/priorityclasses/% false]]   {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.958040577Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:41.51519597 +0000 UTC m=+122.960335055) (total time: 30.442772809s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/clusterrolebindings/% false]]    {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.95805738Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:24.703721241 +0000 UTC m=+106.148860311) (total time: 47.254266965s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/horizontalpodautoscalers/% false]]  {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.958066288Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:40.315286761 +0000 UTC m=+121.760425838) (total time: 31.642926394s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/volumeattachments/% false]] {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.958263937Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:16.164139624 +0000 UTC m=+97.609278726) (total time: 55.794088971s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.theid ASC ) c : [[/registry/pods/% false]]   {"component": "vcluster", "component": "kine", "time": "2024-07-01T10:26:11.958304142Z", "level": "info"}
2024-07-01 10:26:11 INFO    commandwriter/commandwriter.go:126  Slow SQL (started: 2024-07-01 10:25:16.381819263 +0000 UTC m=+97.826958368) (total time: 55.576580865s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv...

When I try to connect to the created vCluster and create a pod, the API takes quite a long time to respond, which interferes with the creation of the pod.

% kubectl get all
% kubectl apply -f https://k8s.io/examples/application/deployment.yaml
deployment.apps/nginx-deployment created
% date; time kubectl get pod
2024年 7月 2日 火曜日 16時39分53秒 JST
NAME                               READY   STATUS    RESTARTS   AGE
nginx-deployment-cbdccf466-ccz6s   0/1     Pending   0          80s
nginx-deployment-cbdccf466-j4dzh   0/1     Pending   0          80s
kubectl get pod  0.06s user 0.04s system 0% cpu 20.149 total
% date; time kubectl get pod
2024年 7月 2日 火曜日 16時42分56秒 JST
NAME                               READY   STATUS    RESTARTS   AGE
nginx-deployment-cbdccf466-j4dzh   0/1     Pending   0          5m31s
nginx-deployment-cbdccf466-ccz6s   1/1     Running   0          5m31s
kubectl get pod  0.06s user 0.03s system 31% cpu 0.281 total

What did you expect to happen?

In another environment, the response is coming in at this rate, so I would like it to be that way. The following log environment is the Minikube environment.

ubuntu@k8s-worker-2:~$ vcluster create my-vcluster --namespace team-xvcluster create my-vcluster --namespace team-x
06:41:36 info Downloading [command helm]
06:41:37 info Creating namespace team-x
06:41:37 info Detected local kubernetes cluster minikube. Will deploy vcluster with a NodePort & sync real nodes
06:41:37 info Create vcluster my-vcluster...
06:41:37 info execute command: helm upgrade my-vcluster /tmp/vcluster-0.20.0-beta.9.tgz-2881249151 --create-namespace --kubeconfig /tmp/3016070667 --namespace team-x --install --repository-config='' --values /tmp/596304531
06:41:39 done Successfully created virtual cluster my-vcluster in namespace team-x
06:41:41 info Waiting for vcluster to come up...
06:41:55 warn vcluster is waiting, because vcluster pod my-vcluster-0 has status: Init:0/3
06:42:07 warn vcluster is waiting, because vcluster pod my-vcluster-0 has status: Init:1/3
06:42:17 warn vcluster is waiting, because vcluster pod my-vcluster-0 has status: Init:2/3
06:42:41 info Starting proxy container...
06:42:46 done Switched active kube context to vcluster_my-vcluster_team-x_minikube
- Use `vcluster disconnect` to return to your previous kube context
- Use `kubectl get namespaces` to access the vcluster

How can we reproduce it (as minimally and precisely as possible)?

% vcluster create vlsuter-debug

% kubectl apply -f https://k8s.io/examples/application/deployment.yaml deployment.apps/nginx-deployment created

Anything else we need to know?

No response

Host cluster Kubernetes version

```console $ kubectl version # % kubectl version Client Version: v1.30.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.27.8 ```

vcluster version

```console vcluster --version vcluster version 0.20.0-beta.5 ```

VCluster Config

No settings.

``` # My vcluster.yaml / values.yaml here ```
FabianKramm commented 3 days ago

@hr-tsu thanks for creating this issue! Are you using NFS or similar as a backing store for the vCluster? These slow SQL queries show that the underlying storage seems to be slow. To check that you can turn off persistence via:

controlPlane:
  statefulSet:
    persistence:
      volumeClaim:
        enabled: false
hr-tsu commented 1 day ago

@FabianKramm Thanks for the comment. The storage product was in this environment, so I put in the settings you gave me and re-created it, and it was created immediately, just like the Minikube environment.

Shouldn't the vCluster environment use the storage product? Will I keep having the same problem if I use the storage product?