Closed hinewin closed 2 days ago
@hinewin Thanks for raising the issue!
Regarding the 403 on certain resources in named groups (e.g the apis/flowcontrol.apiserver.k8s.io
you posted above) could be due to missing RBAC permissions. Per default, vCluster runs with minimal permissions. They can be increased by activating them in the values.yaml. See: https://www.vcluster.com/docs/syncer/core_resources
Regarding the timeouts. Would you mind providing us with more information regarding the utilization of the Kubernetes API server/host cluster. Could it be that it is/was under heavy load?
Hello @johannesfrey, thanks for the quick reply.
Per default, vCluster runs with minimal permissions.
Some of these errors from the log that I provided include endpoints that are already enabled by default. Such as /api/v1/services
, /api/v1/configmaps
and /api/v1/secrets
. For further troubleshooting, I exec into a pod and was able to reach these endpoints successfully, so I find that pretty odd.
I am not exactly getting a timeout error on the kubectl apply
command, however there is a noticeable delay before the request gets completed. From the host cluster perspective, all resources seem to be underutilized.
Hello @johannesfrey, thanks for the quick reply.
Per default, vCluster runs with minimal permissions.
Some of these errors from the log that I provided include endpoints that are already enabled by default. Such as
/api/v1/services
,/api/v1/configmaps
and/api/v1/secrets
. For further troubleshooting, I exec into a pod and was able to reach these endpoints successfully, so I find that pretty odd.I am not exactly getting a timeout error on the
kubectl apply
command, however there is a noticeable delay before the request gets completed. From the host cluster perspective, all resources seem to be underutilized.
Sure 🙂 .
I just saw that the host cluster's Kubernetes version is v1.25, which is not supported anymore with vcluster v0.19.X. I don't know if it's feasible for you to either upgrade the host cluster or to use vcluster v0.18.X (where v1.25 is supported). Just to rule out any incompatibility issues beforehand.
Hi @johannesfrey , I've updated the host cluster to the supported version (v1.26.14) but unfortunately still facing the same issue. vcluster-2024-03-26.log
Same issue here:
syncer 2024-03-28 10:25:48 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/services?allowWatchBookmarks=true&resourceVersion=14537&timeout=6m20s&timeoutSeconds=380&watch=true" audit-ID="e3598946-1fff-4190-b8d4-c8a8ccaa0333" {"
component": "vcluster"}
with k8s 1.28.4 and vcluster 0.19.5
@hinewin @SCLogo would you mind providing the config values you used to create the vclusters, so I could try to reproduce as good as possible? Just to see if there are similarities that might help to track down the reason. Also, could you check if the apiserver of the host cluster did restart at some point in time.
@SCLogo is your host k8s cluster also rancher based?
Attached is a file that contains all of the values I use for the vCluster helm chart. vcluster-values.txt
Regarding
if the apiserver of the host cluster did restart at some point in time.
Sorry I'm not quite sure what exactly you mean by this, but none of the control planes restarted on its own. However, I've recently done a rolling restart on each control plane to upgrade its resources as part of troubleshooting this vCluster delay issue. Other than that, the nodes have always been stable.
Attached is a file that contains all of the values I use for the vCluster helm chart. vcluster-values.txt
Regarding
if the apiserver of the host cluster did restart at some point in time.
Sorry I'm not quite sure what exactly you mean by this, but none of the control planes restarted on its own. However, I've recently done a rolling restart on each control plane to upgrade its resources as part of troubleshooting this vCluster delay issue. Other than that, the nodes have always been stable.
Thx for the info. Yeah, just trying to connect the dots and rule out if there have been any overload (you ruled this out already) or connection issues to the underlying API server of the host. Because the logs look like the watches that the syncer opens to this API server time out or are being aborted.
Also, in your initial post you mentioned the vCluster distro to be k8s but in the values you attached you set the k3sToken
value, which would indicate k3s rather than k8s.
@johannesfrey my mistake, the Vcluster Kubernetes distribution is K3s.
@hinewin @SCLogo would you mind providing the config values you used to create the vclusters, so I could try to reproduce as good as possible? Just to see if there are similarities that might help to track down the reason. Also, could you check if the apiserver of the host cluster did restart at some point in time.
@SCLogo is your host k8s cluster also rancher based?
sorry for late response: here is the values.yaml
api:
image: registry.k8s.io/kube-apiserver:v1.28.2
controller:
image: registry.k8s.io/kube-controller-manager:v1.28.2
coredns:
image: coredns/coredns:1.11.1
etcd:
storage:
className: default
ingress:
annotations:
external-dns.alpha.kubernetes.io/hostname: example.site.host.com
enabled: true
host: example.site.host.com
ingressClassName: internal-ingress
mapServices:
fromHost:
- from: default/app
to: default/app
proxy:
metricsServer:
nodes:
enabled: true
pods:
enabled: true
sync:
ingresses:
enabled: true
networkpolicies:
enabled: true
nodes:
enabled: true
persistentvolumeclaims:
enabled: true
persistentvolumes:
enabled: true
poddisruptionbudgets:
enabled: true
storageclasses:
enabled: true
syncer:
extraArgs:
- --tls-san=example.site.host.com
- --sync-labels="app.kubernetes.io/instance"
apiserver on host restarted 2 times since cluster is running.
no rancher based host. we create hostclusters with kops
@hinewin @SCLogo Thanks for the info and sorry for the late response.
I assume that the issue still occurs? If so, could you test disabling the Metrics Server Proxy and see if this changes the situation. So essentially removing this part from your values:
proxy:
metricsServer:
nodes:
enabled: true
pods:
enabled: true
I have a similar issue, did not yet try to disable the metrics server
2024-04-17 11:06:53 INFO pod.iss-backend syncer/syncer.go:134 update virtual pod iss/iss-backend, because status has changed {"component": "vcluster"}
2024-04-17 11:07:12 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/namespaces/iss/events?resourceVersion=2237&watch=true" audit-ID="9eb369f2-47bd-4134-ba75-84d920a7722b" {"component": "vcluster"}
2024-04-17 11:07:16 INFO node.opti0 syncer/syncer.go:134 update virtual node opti0, because status has changed {"component": "vcluster"}
...
2024-04-17 11:15:09 INFO node.nuc5 syncer/syncer.go:134 update virtual node nuc5, because status has changed {"component": "vcluster"}
2024-04-17 11:15:28 ERROR handler/handler.go:49 Error while proxying request: context canceled {"component": "vcluster"}
2024-04-17 11:15:28 ERROR filters/timeout.go:142 post-timeout activity - time-elapsed: 30.018µs, GET "/openapi/v3" result: <nil> {"component": "vcluster"}
2024-04-17 11:15:58 INFO node.nuc2 syncer/syncer.go:134 update virtual node nuc2, because status has changed {"component": "vcluster"}
Syncer image: ghcr.io/loft-sh/vcluster:0.19.5
It looks like that one can reproducibly enforce this error message (ERROR filters/wrap.go:54 timeout or abort while handling
) by doing e.g. logs -f
or events -w
(essentially streaming via kubectl) and aborting this via CTRL-C (SIGINT) while connected to a virtual cluster.
E.g.:
kubectl -n kube-system logs -f -l=k8s-app=kube-dns
vcluster vcluster-0 syncer 2024-04-17 12:46:09 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/namespaces/kube-system/pods/coredns-68559449b6-s46p7/log?container=coredns&f
ollow=true&tailLines=10" audit-ID="1ab56f93-2163-4b0f-be59-085e275df667" {"component": "vcluster"}
kubectl events -w
vcluster vcluster-0 syncer 2024-04-17 12:47:02 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/namespaces/default/events?limit=500&watch=true" audit-ID="b0fb4065-53c9-4e2c
-8311-e600eb4b3c50" {"component": "vcluster"}
Will check if this is kind of "expected" (which shouldn't log an ERROR then) or not.
Apart from that, is your virtual cluster running fine or do you perceive other slowness/delays? (@everflux @SCLogo)
I experience multi-10-seconds latency when interacting with the api server in the guest cluster. no load on the hosts cluster that could explain it.
Another curiosity that might be related: I installed metrics server in the guest. top nodes works, for top pods I get a permission denied showing the guest namespace from host perspective as part of the error message.
I experience multi-10-seconds latency when interacting with the api server in the guest cluster. no load on the hosts cluster that could explain it.
Another curiosity that might be related: I installed metrics server in the guest. top nodes works, for top pods I get a permission denied showing the guest namespace from host perspective as part of the error message.
Could you try disabling the proxy metrics server. Also, would you mind sharing your values?
I will give it a try once the clusters are no longer in use. Here are my values:
sync:
ingresses:
enabled: true
networkpolicies:
enabled: false
#for metricsserver
nodes:
enabled: true
fakeKubeletIPs: false
syncAllNodes: true
#makes SCs visible
hoststorageclasses:
enabled: true
persistentvolumes:
enabled: true
proxy:
metricsServer:
nodes:
enabled: true
pods:
enabled: false
syncer:
storage:
className: local-path
extraArgs:
- --tls-san=student01.example.com
- --tls-san=student01.student01-vcluster
Unfortunately even after disabling the Metrics Server Proxy, the issue persists. This deployment took 2 minutes to go through and I received this error
W0418 14:28:42.736546 16322 reflector.go:462] vendor/k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 11; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
Unfortunately even after disabling the Metrics Server Proxy, the issue persists. This deployment took 2 minutes to go through and I received this error
W0418 14:28:42.736546 16322 reflector.go:462] vendor/k8s.io/client-go/tools/watch/informerwatcher.go:146: watch of *unstructured.Unstructured ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 11; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
Are those the logs from the syncer? This looks different to the initial error log, which was ERROR filters/wrap.go:54 timeout or abort while handling
.
I re-tested leaving out the metrics proxy and I could not reproduce the problem with that setup, even quickly spawning 100 pods/deleting them did not show any errors in the syncer log.
@johannesfrey
Are those the logs from the syncer?
My apologies, the log message I sent above was from the CLI. Here is the syncer log as I am still experiencing the same issue with metrics proxy being disabled. The issue is quite erratic, I am using a script to create & delete resources and still facing the same delay issue.
Unfortunately leaving out metrics does not seem to solve the problem completely. Here is what I did:
num_pods=100; for ((i=1; i<=$num_pods; i++)); do pod_name="pod-$i"; kubectl run $pod_name --image=php:8-apache; done
kubectl delete pods --all
The following error could be observed, but the (about) 30 seconds complete hang of the api server not responding did not happen.
2024-04-20 12:14:22 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/namespaces/default/pods?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dpod-4&resourceVersion=2781&timeoutSeconds=428&watch=true" audit-ID="fd1f53a9-4c1f-4d2c-9fff-f929505d9f76" {"component": "vcluster"}
This is my yaml for that case
sync:
ingresses:
enabled: true
networkpolicies:
enabled: false
#for metricsserver
nodes:
enabled: true
fakeKubeletIPs: true
syncAllNodes: true
#makes SCs visible
hoststorageclasses:
enabled: true
persistentvolumes:
enabled: true
If I disable the fakeKubeletIPs I could observe the timeouts occuring more often, but as a subjective measurement.
Thanks for testing this out @everflux and for providing additional logs @hinewin. So I guess this problem is two fold:
In your current setup you don't experience any delays right now @everflux?
I am not sure what triggers the unresponsive api server so I have no clue how to force or reproduce the issue. It feels like lock up that it resolved by a timeout or something restarting.
Any ideas to try are appreciated.
I re-tested with the 0.20.0-beta1 and could observe the hang of about 20 seconds again. This is what I see in the log, you can observe the 20 second gap in the syncer from the timestamps.
2024-05-01 10:49:45 INFO controller/controller.go:344 controller node: namespace host# name nuc1: reconcileID "f0967b50-687f-414c-88cf-c9b862f4a331": Reconcile successful {"component": "vcluster-pro"}
2024-05-01 10:49:45 INFO controller/controller.go:344 controller node: namespace name nuc1: reconcileID "0858a8d6-21af-41b6-87ab-78aa145dc805": Reconcile successful {"component": "vcluster-pro"}
2024-05-01 10:50:05 INFO controller/controller.go:315 controller node: namespace host# name opti0: reconcileID "3c39474d-1d61-4ac7-af62-94c113c0907e": Reconciling {"component": "vcluster-pro"}
2024-05-01 10:50:05 INFO controller/controller.go:344 controller node: namespace host# name opti0: reconcileID "3c39474d-1d61-4ac7-af62-94c113c0907e": Reconcile successful {"component": "vcluster-pro"}
2024-05-01 10:50:06 ERROR handler/handler.go:49 Error while proxying request: context canceled {"component": "vcluster-pro"}
2024-05-01 10:50:06 ERROR filters/timeout.go:142 post-timeout activity - time-elapsed: 10.4µs, GET "/openapi/v3" result: <nil> {"component": "vcluster-pro"}
2024-05-01 10:50:06 INFO controller/controller.go:315 controller configmap: namespace code-server name kube-root-ca.crt: reconcileID "e621b770-b89a-47a4-9387-c1584188b469": Reconciling {"component": "vcluster-pro"}
I've noticed a similar limitation (vcluster-k8s 0.19.3). Reverting to old distro (0.15.3)
I observe the same issue with 0.20.0-beta2
@johannesfrey Is there any specific input you are waiting for as this issue is marked as waiting for details?
I noticed this
2024-05-09 11:18:16 ERROR handler/handler.go:49 Error while proxying request: context canceled {"component": "vcluster"}
2024-05-09 11:18:16 INFO commandwriter/commandwriter.go:126 loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: Error, could not get list of group versions for APIService {"component": "vcluster", "component": "k3s", "location": "controller.go:113"}
2024-05-09 11:18:16 INFO commandwriter/commandwriter.go:126 OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. {"component": "vcluster", "component": "k3s", "location": "controller.go:126"}
2024-05-09 11:18:16 ERROR filters/timeout.go:142 post-timeout activity - time-elapsed: 2.495µs, GET "/openapi/v3" result: <nil> {"component": "vcluster"}
which lead me to this issue https://github.com/kubernetes/kubernetes/issues/56430 Is this a problem in the host cluster?
@everflux Sorry for the delay. I'm currently occupied with other tasks, but I'll take a look as soon as possible.
I'm also seeing the behavior described in this issue. I made an issue before realizing that this issue existed. I'll leave it open for now as it has some details relevant to my configuration. I am running vcluster version v0.20.0-beta.5
on GKE 1.29 with the k8s distro.
I've noticed a similar limitation (vcluster-k8s 0.19.3). Reverting to old distro (0.15.3)
@Interaze Did that fix the issue?
@alextricity25 Would it be possible for you to also share your config values for your virtual cluster.
This issue could be related to https://github.com/kubernetes/kubernetes/issues/123448, which was fixed with Kubernetes versions v1.27.13 and v1.28.9
@alextricity25 Would it be possible for you to also share your config values for your virtual cluster.
@johannesfrey Sure thing! Here is my values.yaml
This issue could be related to kubernetes/kubernetes#123448, which was fixed with Kubernetes versions v1.27.13 and v1.28.9
I'm running version v1.29.1-gke.1589018 on the host cluster, and my virtual cluster version is v1.29.0
@alextricity25 v1.29.0 has also that problem can you try with v1.29.3 and above?
Should we test with 1.29.3 as host or vcluster version?
@everflux vCluster version needs to be v1.29.3 or higher
I still had the issue with v1.29.4 (k3s) with host cluster being v1.28.2 As the linked issue seems to be related to etcd: I believe that k3s does use sqlite and not etcd anyway, right? ( I used controlPlane.backingStore.database.embedded.enabled=true) I will give it another try with 1.29.5, though
If it helps I can use the etcd backend for vcluster as well, should I use embedded etcd in that case or let the chart deploy a separate etcd?
@FabianKramm Thanks for your reply.
I assume you mean the version of the k8's api server on the distribution being used. I have been using the k8s
distribution with the following helm values for a few weeks now, and we are still experiencing this issue unfortunately. Note that the api server image tag is v1.29.5
.
distro: {
k8s: {
apiServer: {
image:{
tag: 'v1.29.5',
},
},
enabled: true,
resources: {
requests: {
cpu: 2,
memory: '4Gi',
},
limits: {
cpu: 2,
memory: '4Gi',
},
},
},
},
What should also be considered (already suggested by @everflux) is to configure your virtual cluster according to your needs regarding resource limits and the underlying backing store, especially when considering high intensity workload or e.g. many CRDs. In those cases embedded sql (which is the default backing store in v0.20) could be a bottleneck, as it might experience timeouts or throttling (See: https://sqlite.org/limits.html and regarding load towards the apiserver when leveraging many CRDs see: https://blog.upbound.io/scaling-kubernetes-to-thousands-of-crds). So if it's feasible for you, you might want to test e.g. etcd as backing store in those cases. Do you also happen to use embedded sql as backing store @alextricity25?
I don't understand why this was closed, as it is not obviously resolved?
Sorry, that's part of the Linear integration.
What should also be considered (already suggested by @everflux) is to configure your virtual cluster according to your needs regarding resource limits and the underlying backing store, especially when considering high intensity workload or e.g. many CRDs. In those cases embedded sql (which is the default backing store in v0.20) could be a bottleneck, as it might experience timeouts or throttling (See: https://sqlite.org/limits.html and regarding load towards the apiserver when leveraging many CRDs see: https://blog.upbound.io/scaling-kubernetes-to-thousands-of-crds). So if it's feasible for you, you might want to test e.g. etcd as backing store in those cases. Do you also happen to use embedded sql as backing store @alextricity25?
Sorry for the late reply. I just returned from vacation.
I don't think I'm using embedded sql. I'm using the default helm chart settings with regards to the backing store.
Unfortunately we are still seeing a TON of latency between the host cluster and the vcluster. Some behvior we have noticed lately is:
Creating
or Init
status on the vcluster, but show as Ready
on the host clusterCreating
status in the vcluster, but take 3-5 minutes for them to show up on the host cluster.It's frustrating at times because restarting deployments on the vcluster ends up taking 10-15 minutes where really it should take less than a minute. We've been restarting the vcluster syncer service pod just as a way to "unblock" us after attempting a deployment restart on the vcluster.
Here is a quick summary of my setup again, for reference:
Host cluster: GKE v1.30.1-gke.1261000 vCluster: k8s distro, v1.29.5
helm chart values:
controlPlane:
distro:
k8s:
apiServer:
image:
tag: v1.29.5
enabled: true
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 2
memory: 4Gi
securityContext:
seccompProfile:
type: Unconfined
hostPathMapper:
enabled: true
proxy:
extraSANs:
- my-host.example.dev
statefulSet:
resources:
limits:
cpu: "3"
ephemeral-storage: 8Gi
memory: 10Gi
requests:
cpu: "3"
ephemeral-storage: 8Gi
memory: 10Gi
exportKubeConfig:
server: https://my-host.example.dev
sync:
fromHost:
nodes:
enabled: true
syncBackChanges: true
toHost:
persistentVolumeClaims:
enabled: true
serviceAccounts:
enabled: true
@alextricity25 Thx for your summary. By looking at your vcluster.yaml you are indeed using embedded sql. We also mentioned it in the documentation of the beta release (unfortunately these notes did not find their way in every release note):
Embedded SQLite is now the default backing store
If it's feasible could you try using etcd as backing store via:
controlPlane:
distro:
k8s:
enabled: true
backingStore:
etcd:
deploy:
enabled: true
I updated the host cluster to use 1.29.6 and I think it behaved a little better, though I encountered freezes as well. (So no fix)
I would like to change as little as possible to pinpoint the cause, would this help validating your assumption as well?
controlPlane:
distro:
k3s: #note, k3s
enabled: true
backingStore:
etcd:
deploy:
enabled: true
or even
controlPlane:
distro:
k3s:
enabled: true
backingStore:
etcd:
embedded:
enabled: true
or is that a waste of time?
@everflux yes, I would at least switch the backing store to a more scalable one like etcd.
I could not make a test before the weekend. But I strongly assume that the issue is resolved with a fix for https://github.com/loft-sh/vcluster/issues/1765
[>](todays-0 syncer 2024-07-11 15:36:41.txt) @hinewin Thanks for raising the issue!
Regarding the 403 on certain resources in named groups (e.g the
apis/flowcontrol.apiserver.k8s.io
you posted above) could be due to missing RBAC permissions. Per default, vCluster runs with minimal permissions. They can be increased by activating them in the values.yaml. See: https://www.vcluster.com/docs/syncer/core_resourcesRegarding the timeouts. Would you mind providing us with more information regarding the utilization of the Kubernetes API server/host cluster. Could it be that it is/was under heavy load?
Hi,
I am observing the same issue on 0.19.1.
It happens with helm charts that run workloads that do api calls to kube-api. One of them is crossplaned and e.g install providers that I am getting errors due to missing rbac what throttles api and brings vcluster fully down and only fixable with a pvc deletion. Another one that i saw same behavior was with kyverno. What's strange is that this behaviour i only observe this last week.
[Uploading todays-0 syncer 2024-07-11 15:36:41.txt…]() Attached some logs from the syncer
What happened?
We are currently running on vcluster version v0.19.0, and have encountered some issues with accessing the Kubernetes API from the syncer container.
We have been observing a series of timeout errors while making GET requests to various API endpoints from within the syncer container. The logs include messages such as:
2024-03-08 19:50:10 ERROR filters/wrap.go:54 timeout or abort while handling: method=GET URI="/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=17782778&watch=true" audit-ID="aff4ede2-a83c-4077-8a97-1bb29628aa2a" {"component": "vcluster"}
The majority of the errors include timeouts or aborts while making requests to watch Kubernetes resources. Additionally, we are also observing a delay while trying to deploy any Kubernetes resources, the
kubectl apply -f
command stalls for a long period before the operation actually gets executed. We suspect that these delays and API access issues might be related.As per our investigation, we tried to exec into the syncer container and directly access the mentioned endpoints. The endpoints with "API" returned a 200 status without any issues while "APIS" endpoints returned 403.
"API" endpoint tested:
${APISERVER}/api/v1/persistentvolumes?allowWatchBookmarks=true&resourceVersion=17782782&watch=true
"APIS" Endpoint tested:apis/flowcontrol.apiserver.k8s.io/v1/flowschemas?allowWatchBookmarks=true&resourceVersion=17782783& watch=true
syncer.log
What did you expect to happen?
We would expect the API requests from within the syncer container to be handled without any issues or timeouts. The various Kubernetes endpoints should be accessible and respond in a timely manner, with the correct HTTP response codes.
On deploying YAML files,
kubectl apply -f
should execute seamlessly and promptly, without any noticeable delays.How can we reproduce it (as minimally and precisely as possible)?
Set up vcluster to v0.19.0 and attempt to access Kubernetes API endpoints manually
Anything else we need to know?
No response
Host cluster Kubernetes version
Host cluster Kubernetes distribution
vlcuster version
Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)
OS and Arch