Closed vishalvrv9 closed 10 months ago
@vishalvrv9 the problem might be sporadic, have you tried to re-run on a clean node?
@ustiugov We initially suspected it to be an intermittent issue. Upon retrying setting up the vhive infrastructure multiple times, we ran into the same problem. Interestingly, it is the same set of pods that are giving the same set of errors every time.
We have been following the quick setup guide every time. I will try setting up the cluster today again. Do you think I can provide specific information wrt to the issue to further triage this?
EDIT: tagged @ustiugov
@anshalshukla can you please try to run a vHive cluster on cloudlab? I guess, this is the default mode (default in the quickstart guide of vHive)with Firecracker MicroVMs. does deployment and invoking of a function works?
Hello @ustiugov @anshalshukla We have been trying to further analyze the root cause of this error. After setting up infra run again and letting it run for a while we tried checking pod status, the results of which are below:
Our usual suspects are:
We checked the logs for autoscaler and activator pods and found some failures. They are mentioned below:
Namespace: knative-serving, Pod: activator-*
{"severity":"ERROR","timestamp":"2023-02-08T06:22:39.12319257Z","logger":"activator","caller":"websocket/connection.go:145","message":"Websocket connection could not be established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"dial tcp 10.104.62.23:8080: connect: connection refused","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:145\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:226\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:226\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:421\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:223\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:163"}
{"severity":"WARNING","timestamp":"2023-02-08T06:22:39.817293228Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4"}
{"severity":"ERROR","timestamp":"2023-02-08T06:22:40.390574263Z","logger":"activator","caller":"websocket/connection.go:192","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:192"}
{"severity":"ERROR","timestamp":"2023-02-08T06:22:41.04317903Z","logger":"activator","caller":"websocket/connection.go:145","message":"Websocket connection could not be established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"dial tcp 10.104.62.23:8080: connect: connection refused","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:145\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:226\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:226\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:421\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:223\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:163"}
Namespace: knative-serving, Pod: autoscaler-*
E0214 07:14:32.756949 1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
E0214 07:14:43.168916 1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
E0214 07:14:53.768180 1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
We also checked the configmap "config-domain" (under the namespace knative-serving) as the pod associated with that config seems to be throwing an error. Upon looking there, we found this:
----
################################
# #
# EXAMPLE CONFIGURATION #
# #
################################
# This block is not actually functional configuration,
# but serves to illustrate the available configuration
# options and document them in a way that is accessible
# to users that `kubectl edit` this config map.
#
# These sample configuration options may be copied out of
# this example block and unindented to be in the data block
# to actually change the configuration.
# Default value for domain.
# Although it will match all routes, it is the least-specific rule so it
# will only be used if no other domain matches.
example.com: |
# These are example settings of domain.
# example.org will be used for routes having app=nonprofit.
example.org: |
selector:
app: nonprofit
# Routes having the cluster domain suffix (by default 'svc.cluster.local')
# will not be exposed through Ingress. You can define your own label
# selector to assign that domain suffix to your Route here, or you can set
# the label
# "networking.knative.dev/visibility=cluster-local"
# to achieve the same effect. This shows how to make routes having
# the label app=secret only exposed to the local cluster.
svc.cluster.local: |
selector:
app: secret
BinaryData
====
Events: <none>
We looked at the logs for kube-scheduler-vhive-vishal-1
,kube-controller-manager-vhive-vishal-1
,etcd-vhive-vishal-1
and kube-apiserver-vhive-vishal-1
under the namespace kube-system
We looked at aforementioned pods as they were the ones that seemed to be restarting
Pod: kube-scheduler-vhive-vishal-1, namespace: kube-system
I0211 15:32:06.280005 1 serving.go:348] Generated self-signed cert in-memory
I0211 15:32:06.680186 1 server.go:139] "Starting Kubernetes Scheduler" version="v1.23.16"
I0211 15:32:07.011279 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0211 15:32:07.011372 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0211 15:32:07.011386 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0211 15:32:07.011422 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0211 15:32:07.011683 1 secure_serving.go:200] Serving securely on 127.0.0.1:10259
I0211 15:32:07.013135 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0211 15:32:07.013340 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0211 15:32:07.013409 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0211 15:32:07.111964 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0211 15:32:07.112172 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0211 15:32:07.112633 1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I0211 15:32:07.113770 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0211 15:32:23.347139 1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler
Pod: kube-controller-manager-vhive-vishal-1, namespace: knative-serving
E0214 07:42:27.925393 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/cluster-local-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:27.925588 1 event.go:294] "Event occurred" object="istio-system/cluster-local-gateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:27.925696 1 event.go:294] "Event occurred" object="istio-system/cluster-local-gateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:28.254165 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/mt-broker-ingress: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:28.254603 1 event.go:294] "Event occurred" object="knative-eventing/broker-ingress-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:28.254663 1 event.go:294] "Event occurred" object="knative-eventing/broker-ingress-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:31.615500 1 event.go:294] "Event occurred" object="knative-serving/activator" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:31.615582 1 event.go:294] "Event occurred" object="knative-serving/activator" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:31.615899 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-serving/activator: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
E0214 07:42:32.550121 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-serving/webhook: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.550333 1 event.go:294] "Event occurred" object="knative-serving/webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.550397 1 event.go:294] "Event occurred" object="knative-serving/webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:32.992200 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/istio-ingressgateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.992327 1 event.go:294] "Event occurred" object="istio-system/istio-ingressgateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.992577 1 event.go:294] "Event occurred" object="istio-system/istio-ingressgateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:32.996529 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/istiod: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.996631 1 event.go:294] "Event occurred" object="istio-system/istiod" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.996685 1 event.go:294] "Event occurred" object="istio-system/istiod" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:33.205675 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/mt-broker-filter: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:33.205833 1 event.go:294] "Event occurred" object="knative-eventing/broker-filter-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:33.205876 1 event.go:294] "Event occurred" object="knative-eventing/broker-filter-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:33.591723 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/eventing-webhook: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:33.591993 1 event.go:294] "Event occurred" object="knative-eventing/eventing-webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:33.592055 1 event.go:294] "Event occurred" object="knative-eventing/eventing-webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
Pod: etcd-vhive-vishal-1, namespace: knative-serving
{"level":"info","ts":"2023-02-14T07:43:51.605Z","caller":"traceutil/trace.go:171","msg":"trace[1724407267] transaction","detail":"{read_only:false; response_revision:3387309; number_of_response:1; }","duration":"127.686957ms","start":"2023-02-14T07:43:51.477Z","end":"2023-02-14T07:43:51.605Z","steps":["trace[1724407267] 'process raft request' (duration: 101.241909ms)","trace[1724407267] 'compare' (duration: 26.235243ms)"],"step_count":2}
{"level":"info","ts":"2023-02-14T07:43:51.605Z","caller":"traceutil/trace.go:171","msg":"trace[783123863] transaction","detail":"{read_only:false; response_revision:3387310; number_of_response:1; }","duration":"127.524367ms","start":"2023-02-14T07:43:51.477Z","end":"2023-02-14T07:43:51.605Z","steps":["trace[783123863] 'process raft request' (duration: 127.263362ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-14T07:43:55.769Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"122.944399ms","expected-duration":"100ms","prefix":"","request":"header:<ID:6257617818430884549 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/leases/kube-system/kube-controller-manager\" mod_revision:3387324 > success:<request_put:<key:\"/registry/leases/kube-system/kube-controller-manager\" value_size:439 >> failure:<request_range:<key:\"/registry/leases/kube-system/kube-controller-manager\" > >>","response":"size:20"}
{"level":"info","ts":"2023-02-14T07:43:55.770Z","caller":"traceutil/trace.go:171","msg":"trace[109153835] transaction","detail":"{read_only:false; response_revision:3387339; number_of_response:1; }","duration":"128.66627ms","start":"2023-02-14T07:43:55.641Z","end":"2023-02-14T07:43:55.770Z","steps":["trace[109153835] 'compare' (duration: 122.826285ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-14T07:44:02.280Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"100.626724ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-serving/webhook.webhookcertificates.00-of-01\" ","response":"range_response_count:1 size:553"}
{"level":"info","ts":"2023-02-14T07:44:02.280Z","caller":"traceutil/trace.go:171","msg":"trace[1028192541] range","detail":"{range_begin:/registry/leases/knative-serving/webhook.webhookcertificates.00-of-01; range_end:; response_count:1; response_revision:3387376; }","duration":"100.734311ms","start":"2023-02-14T07:44:02.179Z","end":"2023-02-14T07:44:02.280Z","steps":["trace[1028192541] 'range keys from in-memory index tree' (duration: 100.487699ms)"],"step_count":1}
{"level":"info","ts":"2023-02-14T07:44:03.448Z","caller":"traceutil/trace.go:171","msg":"trace[1989288108] linearizableReadLoop","detail":"{readStateIndex:3502667; appliedIndex:3502667; }","duration":"176.675242ms","start":"2023-02-14T07:44:03.272Z","end":"2023-02-14T07:44:03.448Z","steps":["trace[1989288108] 'read index received' (duration: 176.667313ms)","trace[1989288108] 'applied index is now lower than readState.Index' (duration: 6.576µs)"],"step_count":2}
{"level":"warn","ts":"2023-02-14T07:44:03.628Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"356.128666ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-eventing/controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01\" ","response":"range_response_count:1 size:649"}
{"level":"warn","ts":"2023-02-14T07:44:03.628Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"355.619515ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-eventing/eventing-webhook.sinkbindings.00-of-01\" ","response":"range_response_count:1 size:569"}
{"level":"info","ts":"2023-02-14T07:44:03.628Z","caller":"traceutil/trace.go:171","msg":"trace[1385736479] range","detail":"{range_begin:/registry/leases/knative-eventing/controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01; range_end:; response_count:1; response_revision:3387384; }","duration":"356.236612ms","start":"2023-02-14T07:44:03.272Z","end":"2023-02-14T07:44:03.628Z","steps":["trace[1385736479] 'agreement among raft nodes before linearized reading' (duration: 176.793143ms)","trace[1385736479] 'range keys from in-memory index tree' (duration: 179.263917ms)"],"step_count":2}
Pod: kube-apiserver-vhive-vishal-1, namespace: kube-system
I0214 07:46:31.759011 1 trace.go:205] Trace[1551177797]: "Call validating webhook" configuration:istio-validator-istio-system,webhook:rev.validation.istio.io,resource:networking.istio.io/v1alpha3, Resource=gateways,subresource:,operation:CREATE,UID:08ba3463-cd50-4ce7-8356-eb312be370a0 (14-Feb-2023 07:46:21.757) (total time: 10001ms):
Trace[1551177797]: [10.001103067s] [10.001103067s] END
W0214 07:46:31.759056 1 dispatcher.go:142] Failed calling webhook, failing open rev.validation.istio.io: failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": context deadline exceeded
E0214 07:46:31.759115 1 dispatcher.go:149] failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": context deadline exceeded
I0214 07:46:31.763365 1 trace.go:205] Trace[1047610551]: "Create" url:/apis/networking.istio.io/v1alpha3/namespaces/istio-system/gateways,user-agent:pilot-discovery/1.12.5,audit-id:7c0e1722-385b-4b61-90da-c258f4ceed05,client:192.168.0.8,accept:application/json, */*,protocol:HTTP/2.0 (14-Feb-2023 07:46:21.756) (total time: 10006ms):
Trace[1047610551]: ---"Object stored in database" 10006ms (07:46:31.763)
Trace[1047610551]: [10.006895846s] [10.006895846s] END
I have tried to consolidate as much information pertaining to our infrastructure setup as possible. Do you think we are looking in the right direction? If not, could you suggest on how should we go about finding the root cause of this issue so as to successfully setup our test cluster?
If you think some other details would be helpful, please let us know
Hey @vishalvrv9, which type of machine are you using on cloudlab and with how many nodes? Over the past week I've tried running it multiple times on xl170
Utah and m400
Mass nodes and haven't faced any issue with the code on the main branch.
Still I'll go through the logs and see if I can find anything.
Hey @anshalshukla
We have been using our own cluster that comes under NERC (New England Research Cloud) (https://nerc-project.github.io/nerc-docs/)
We two instances of cpu-a.8
as our choice of compute under the nova
openstack offerings withing NERC.
More details on NERC Openstack compute offerings can be found here
Our lscpu
looks something like this:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 8
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel Xeon Processor (Cascadelake)
Stepping: 5
CPU MHz: 2893.202
BogoMIPS: 5786.40
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 32 MiB
L3 cache: 128 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall
nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma
cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm
abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap c
lflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_
clear arch_capabilities
Hope this helps.
@vishalvrv9 do you observe any errors when running the vHive setup scripts?
@vishalvrv9 is the issue still relevant?
@ustiugov No significant errors were found when running the vHive setup scripts. I'm attaching the vhive logs from the remote vm via this comment.
I also tried setting up the vhive cluster with @anshalshukla (by following the exact same steps) but we did not really find root cause as to why the default-domain wasn't up.
Please find the logs attached below: containerd.stderr.txt containerd.stdout.txt create_multinode_cluster.stderr.txt create_multinode_cluster.stdout.txt setup_node.stderr.txt setup_node.stdout.txt
@NeHalem90 could you please take a look at the logs?
@vishalvrv9 Is the issue still relevant? Recently, we have updated the setup process with Go instead of Bash scripts, improving the logs, may be that helps.
@leokondrashov Thanks for the update. I will check on this and report my findings.
I just checked the setup readme and its still bash scripts.
Are you referring to the setup process under /scripts/cluster/*
?
Are you referring to the setup process under
/scripts/cluster/*
?
Yes, there is a README in /scripts/ directory as well.
@vishalvrv9 Please reopen if still relevant
Describe the bug After following the quick start setup (Setup knative cluster) instructions, we always see the
default-domain-*
list of pods inError
state upon runningsudo kubectl get pods --all-namespaces
The registry-etc-hosts-update is also observed to be in
Init: CrashLoopBackOff
To Reproduce Following the Quick Start Guide to Setting up a Knative Cluster on the below Hardware Configuration: [8 VCPUs, 16GB RAM, 300GB Volume, ubuntu-20.04-x86_64]
Expected behavior All the pods should be up and running without any interruptions
Logs The tail logs for above mentioned pods are attached below:
NameSpace: kube-system; Pod: kube-api-server
*NameSpace: knative-serving; Pod: default-domain **
*NameSpace: knative-serving; Pod: webhook **
NameSpace: istio-system; Pod: istiod
Notes We have attach full logs for the above mentioned pods below istio-system-istiod.txt knative-serving-default-domain.txt knative-serving-webhook-logs.txt kube-system-api-server-logs.txt