vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
274 stars 86 forks source link

vHive setup on on-prem cluster: kube-apiserver and istiod (under namespaces knative-serving and istio-system) are always down #666

Closed vishalvrv9 closed 10 months ago

vishalvrv9 commented 1 year ago

Describe the bug After following the quick start setup (Setup knative cluster) instructions, we always see the default-domain-* list of pods in Error state upon running sudo kubectl get pods --all-namespaces

The registry-etc-hosts-update is also observed to be in Init: CrashLoopBackOff Screenshot 2023-02-06 at 7 26 41 PM

To Reproduce Following the Quick Start Guide to Setting up a Knative Cluster on the below Hardware Configuration: [8 VCPUs, 16GB RAM, 300GB Volume, ubuntu-20.04-x86_64]

Expected behavior All the pods should be up and running without any interruptions

Logs The tail logs for above mentioned pods are attached below:

NameSpace: kube-system; Pod: kube-api-server

I0207 00:09:55.938980       1 trace.go:205] Trace[426502956]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/knative-serving/leases/net-istio-controller.knative.dev.net-istio.pkg.reconciler.ingress.reconciler.00-of-01,user-agent:Go-http-client/2.0,audit-id:60d1119a-dbaa-46b1-8f90-b7dc98344e36,client:192.168.0.224,accept:application/json, */*,protocol:HTTP/2.0 (07-Feb-2023 00:09:55.371) (total time: 567ms):
Trace[426502956]: ---"Object stored in database" 567ms (00:09:55.938)
Trace[426502956]: [567.525958ms] [567.525958ms] END
I0207 00:09:55.939128       1 trace.go:205] Trace[1379482145]: "GuaranteedUpdate etcd3" type:*coordination.Lease (07-Feb-2023 00:09:55.371) (total time: 567ms):
Trace[1379482145]: ---"Transaction committed" 566ms (00:09:55.938)
Trace[1379482145]: [567.50609ms] [567.50609ms] END
I0207 00:09:55.939256       1 trace.go:205] Trace[974352719]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/knative-eventing/leases/controller.knative.dev.eventing.pkg.reconciler.parallel.reconciler.00-of-01,user-agent:Go-http-client/2.0,audit-id:9341f8e0-a539-462b-8f58-dde97beaf575,client:192.168.0.224,accept:application/json, */*,protocol:HTTP/2.0 (07-Feb-2023 00:09:55.371) (total time: 568ms):
Trace[974352719]: ---"Object stored in database" 567ms (00:09:55.939)
Trace[974352719]: [568.007618ms] [568.007618ms] END
I0207 00:12:47.642471       1 trace.go:205] Trace[970501940]: "Call validating webhook" configuration:istio-validator-istio-system,webhook:rev.validation.istio.io,resource:networking.istio.io/v1alpha3, Resource=gateways,subresource:,operation:CREATE,UID:bae73e75-36fd-4976-bf68-305bbda55a2f (07-Feb-2023 00:12:37.641) (total time: 10000ms):
Trace[970501940]: [10.000856817s] [10.000856817s] END
W0207 00:12:47.642546       1 dispatcher.go:142] Failed calling webhook, failing open rev.validation.istio.io: failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": dial tcp 10.110.72.14:443: i/o timeout
E0207 00:12:47.642591       1 dispatcher.go:149] failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": dial tcp 10.110.72.14:443: i/o timeout
I0207 00:12:47.647274       1 trace.go:205] Trace[866641448]: "Create" url:/apis/networking.istio.io/v1alpha3/namespaces/istio-system/gateways,user-agent:pilot-discovery/1.12.5,audit-id:3c83f257-213f-4ad7-9940-886fa3271ea5,client:192.168.0.224,accept:application/json, */*,protocol:HTTP/2.0 (07-Feb-2023 00:12:37.640) (total time: 10006ms):
Trace[866641448]: ---"Object stored in database" 10006ms (00:12:47.647)
Trace[866641448]: [10.006845809s] [10.006845809s] END

*NameSpace: knative-serving; Pod: default-domain **

W0207 00:05:54.050436       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"level":"fatal","ts":1675728364.0815299,"logger":"fallback.default-domain","caller":"default-domain/main.go:199","msg":"Error finding gateway address","error":"Internal error occurred: failed calling webhook \"webhook.serving.knative.dev\": failed to call webhook: Post \"https://webhook.knative-serving.svc:443/defaulting?timeout=10s\": context deadline exceeded","stacktrace":"main.main\n\tknative.dev/serving/cmd/default-domain/main.go:199\nruntime.main\n\truntime/proc.go:250"}

*NameSpace: knative-serving; Pod: webhook **

{"severity":"ERROR","timestamp":"2023-02-07T00:04:04.837264305Z","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:566","message":"Reconcile error","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"df258531-e347-44e2-85fb-38058dc2bf7d","knative.dev/key":"validation.webhook.serving.knative.dev","duration":"57.035969ms","error":"failed to update webhook: Operation cannot be fulfilled on validatingwebhookconfigurations.admissionregistration.k8s.io \"validation.webhook.serving.knative.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2023-02-07T00:04:04.837769639Z","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:566","message":"Reconcile error","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"0b67f2ff-28b1-4f7b-a3b4-df001cd8cd64","knative.dev/key":"knative-serving/webhook-certs","duration":"64.466524ms","error":"failed to update webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io \"webhook.serving.knative.dev\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/controller/controller.go:491"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.93950614Z","logger":"webhook.DefaultingWebhook","caller":"defaulting/defaulting.go:253","message":"Updating webhook","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"e7711d7b-aa7e-44a5-bbdb-de4ae8a4a44d","knative.dev/key":"knative-serving/webhook-certs"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.946963467Z","logger":"webhook.ValidationWebhook","caller":"validation/reconcile_config.go:173","message":"Updating webhook","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"c80b483b-3f53-4a2c-add5-207a4e875a9b","knative.dev/key":"validation.webhook.serving.knative.dev"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.95046257Z","logger":"webhook.ConfigMapWebhook","caller":"configmaps/configmaps.go:168","message":"Webhook is valid","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"73e9c744-44ce-417f-b378-f53413dad8a6","knative.dev/key":"config.webhook.serving.knative.dev"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.95048909Z","logger":"webhook.ConfigMapWebhook","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"73e9c744-44ce-417f-b378-f53413dad8a6","knative.dev/key":"config.webhook.serving.knative.dev","duration":"101.48499ms"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.951523915Z","logger":"webhook.DefaultingWebhook","caller":"defaulting/defaulting.go:253","message":"Updating webhook","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"78710d05-4ddb-4ca4-b8fc-31b04d2a49c7","knative.dev/key":"webhook.serving.knative.dev"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.952116023Z","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"e7711d7b-aa7e-44a5-bbdb-de4ae8a4a44d","knative.dev/key":"knative-serving/webhook-certs","duration":"108.86406ms"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.952697113Z","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"c80b483b-3f53-4a2c-add5-207a4e875a9b","knative.dev/key":"validation.webhook.serving.knative.dev","duration":"115.268759ms"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.957654588Z","logger":"webhook.DefaultingWebhook","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"78710d05-4ddb-4ca4-b8fc-31b04d2a49c7","knative.dev/key":"webhook.serving.knative.dev","duration":"129.366146ms"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.965047084Z","logger":"webhook.ValidationWebhook","caller":"validation/reconcile_config.go:173","message":"Updating webhook","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"869bec0a-469f-46a4-a2f5-1ca79d483a69","knative.dev/key":"validation.webhook.serving.knative.dev"}
{"severity":"INFO","timestamp":"2023-02-07T00:04:04.972622162Z","logger":"webhook.ValidationWebhook","caller":"controller/controller.go:550","message":"Reconcile succeeded","commit":"8ab7e69","knative.dev/pod":"webhook-557f4b554d-hcc9c","knative.dev/traceid":"869bec0a-469f-46a4-a2f5-1ca79d483a69","knative.dev/key":"validation.webhook.serving.knative.dev","duration":"19.888906ms"}

NameSpace: istio-system; Pod: istiod

2023-02-07T00:32:06.726068Z info    rootcertrotator Jitter complete, start rotator.
2023-02-07T00:33:28.051186Z info    validationController    Not ready to switch validation to fail-closed: dummy invalid config not rejected
2023-02-07T00:33:28.051285Z info    validationController    validatingwebhookconfiguration istio-validator-istio-system (failurePolicy=Ignore, resourceVersion=1108) is up-to-date. No change required.
2023-02-07T00:38:38.067694Z info    validationController    Not ready to switch validation to fail-closed: dummy invalid config not rejected
2023-02-07T00:38:38.067781Z info    validationController    validatingwebhookconfiguration istio-validator-istio-system (failurePolicy=Ignore, resourceVersion=1108) is up-to-date. No change required.
2023-02-07T00:43:48.087961Z info    validationController    Not ready to switch validation to fail-closed: dummy invalid config not rejected
2023-02-07T00:43:48.088056Z info    validationController    validatingwebhookconfiguration istio-validator-istio-system (failurePolicy=Ignore, resourceVersion=1108) is up-to-date. No change required.

Notes We have attach full logs for the above mentioned pods below istio-system-istiod.txt knative-serving-default-domain.txt knative-serving-webhook-logs.txt kube-system-api-server-logs.txt

ustiugov commented 1 year ago

@vishalvrv9 the problem might be sporadic, have you tried to re-run on a clean node?

vishalvrv9 commented 1 year ago

@ustiugov We initially suspected it to be an intermittent issue. Upon retrying setting up the vhive infrastructure multiple times, we ran into the same problem. Interestingly, it is the same set of pods that are giving the same set of errors every time.

We have been following the quick setup guide every time. I will try setting up the cluster today again. Do you think I can provide specific information wrt to the issue to further triage this?

EDIT: tagged @ustiugov

ustiugov commented 1 year ago

@anshalshukla can you please try to run a vHive cluster on cloudlab? I guess, this is the default mode (default in the quickstart guide of vHive)with Firecracker MicroVMs. does deployment and invoking of a function works?

vishalvrv9 commented 1 year ago

Hello @ustiugov @anshalshukla We have been trying to further analyze the root cause of this error. After setting up infra run again and letting it run for a while we tried checking pod status, the results of which are below:

Screenshot 2023-02-14 at 2 24 30 AM

Our usual suspects are:

  1. Pods in namespace knative-serving (as the default domain is failing)
  2. The pods that have been restarting (kube-scheduler-vhive-vishal-1, kube-controller-manager-vhive-vishal-1, kube-apiserver-vhive-vishal-1, etcd-vhive-vishal-1)

For (1),

We checked the logs for autoscaler and activator pods and found some failures. They are mentioned below:

Namespace: knative-serving, Pod: activator-*

{"severity":"ERROR","timestamp":"2023-02-08T06:22:39.12319257Z","logger":"activator","caller":"websocket/connection.go:145","message":"Websocket connection could not be established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"dial tcp 10.104.62.23:8080: connect: connection refused","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:145\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:226\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:226\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:421\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:223\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:163"}
{"severity":"WARNING","timestamp":"2023-02-08T06:22:39.817293228Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4"}
{"severity":"ERROR","timestamp":"2023-02-08T06:22:40.390574263Z","logger":"activator","caller":"websocket/connection.go:192","message":"Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"connection has not yet been established","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func3\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:192"}
{"severity":"ERROR","timestamp":"2023-02-08T06:22:41.04317903Z","logger":"activator","caller":"websocket/connection.go:145","message":"Websocket connection could not be established","commit":"8ab7e69","knative.dev/controller":"activator","knative.dev/pod":"activator-7f7865c9f5-jkld4","error":"dial tcp 10.104.62.23:8080: connect: connection refused","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:145\nknative.dev/pkg/websocket.(*ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:226\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:226\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.23.5/pkg/util/wait/wait.go:421\nknative.dev/pkg/websocket.(*ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:223\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20220412134708-e325df66cb51/websocket/connection.go:163"}

Namespace: knative-serving, Pod: autoscaler-*

E0214 07:14:32.756949       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
E0214 07:14:43.168916       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
E0214 07:14:53.768180       1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again

We also checked the configmap "config-domain" (under the namespace knative-serving) as the pod associated with that config seems to be throwing an error. Upon looking there, we found this:

----
################################
#                              #
#    EXAMPLE CONFIGURATION     #
#                              #
################################

# This block is not actually functional configuration,
# but serves to illustrate the available configuration
# options and document them in a way that is accessible
# to users that `kubectl edit` this config map.
#
# These sample configuration options may be copied out of
# this example block and unindented to be in the data block
# to actually change the configuration.

# Default value for domain.
# Although it will match all routes, it is the least-specific rule so it
# will only be used if no other domain matches.
example.com: |

# These are example settings of domain.
# example.org will be used for routes having app=nonprofit.
example.org: |
  selector:
    app: nonprofit

# Routes having the cluster domain suffix (by default 'svc.cluster.local')
# will not be exposed through Ingress. You can define your own label
# selector to assign that domain suffix to your Route here, or you can set
# the label
#    "networking.knative.dev/visibility=cluster-local"
# to achieve the same effect.  This shows how to make routes having
# the label app=secret only exposed to the local cluster.
svc.cluster.local: |
  selector:
    app: secret

BinaryData
====

Events:  <none>

For (2),

We looked at the logs for kube-scheduler-vhive-vishal-1,kube-controller-manager-vhive-vishal-1,etcd-vhive-vishal-1 and kube-apiserver-vhive-vishal-1 under the namespace kube-system

We looked at aforementioned pods as they were the ones that seemed to be restarting

Pod: kube-scheduler-vhive-vishal-1, namespace: kube-system

I0211 15:32:06.280005       1 serving.go:348] Generated self-signed cert in-memory
I0211 15:32:06.680186       1 server.go:139] "Starting Kubernetes Scheduler" version="v1.23.16"
I0211 15:32:07.011279       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0211 15:32:07.011372       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0211 15:32:07.011386       1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0211 15:32:07.011422       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0211 15:32:07.011683       1 secure_serving.go:200] Serving securely on 127.0.0.1:10259
I0211 15:32:07.013135       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0211 15:32:07.013340       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0211 15:32:07.013409       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0211 15:32:07.111964       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0211 15:32:07.112172       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0211 15:32:07.112633       1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I0211 15:32:07.113770       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0211 15:32:23.347139       1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler

Pod: kube-controller-manager-vhive-vishal-1, namespace: knative-serving

E0214 07:42:27.925393       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/cluster-local-gateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:27.925588       1 event.go:294] "Event occurred" object="istio-system/cluster-local-gateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:27.925696       1 event.go:294] "Event occurred" object="istio-system/cluster-local-gateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:28.254165       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/mt-broker-ingress: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:28.254603       1 event.go:294] "Event occurred" object="knative-eventing/broker-ingress-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:28.254663       1 event.go:294] "Event occurred" object="knative-eventing/broker-ingress-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:31.615500       1 event.go:294] "Event occurred" object="knative-serving/activator" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:31.615582       1 event.go:294] "Event occurred" object="knative-serving/activator" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:31.615899       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-serving/activator: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
E0214 07:42:32.550121       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-serving/webhook: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.550333       1 event.go:294] "Event occurred" object="knative-serving/webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.550397       1 event.go:294] "Event occurred" object="knative-serving/webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:32.992200       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/istio-ingressgateway: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.992327       1 event.go:294] "Event occurred" object="istio-system/istio-ingressgateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.992577       1 event.go:294] "Event occurred" object="istio-system/istio-ingressgateway" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:32.996529       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/istio-system/istiod: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:32.996631       1 event.go:294] "Event occurred" object="istio-system/istiod" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:32.996685       1 event.go:294] "Event occurred" object="istio-system/istiod" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:33.205675       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/mt-broker-filter: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:33.205833       1 event.go:294] "Event occurred" object="knative-eventing/broker-filter-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:33.205876       1 event.go:294] "Event occurred" object="knative-eventing/broker-filter-hpa" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
E0214 07:42:33.591723       1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/knative-eventing/eventing-webhook: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
I0214 07:42:33.591993       1 event.go:294] "Event occurred" object="knative-eventing/eventing-webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"
I0214 07:42:33.592055       1 event.go:294] "Event occurred" object="knative-eventing/eventing-webhook" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)"

Pod: etcd-vhive-vishal-1, namespace: knative-serving

{"level":"info","ts":"2023-02-14T07:43:51.605Z","caller":"traceutil/trace.go:171","msg":"trace[1724407267] transaction","detail":"{read_only:false; response_revision:3387309; number_of_response:1; }","duration":"127.686957ms","start":"2023-02-14T07:43:51.477Z","end":"2023-02-14T07:43:51.605Z","steps":["trace[1724407267] 'process raft request'  (duration: 101.241909ms)","trace[1724407267] 'compare'  (duration: 26.235243ms)"],"step_count":2}
{"level":"info","ts":"2023-02-14T07:43:51.605Z","caller":"traceutil/trace.go:171","msg":"trace[783123863] transaction","detail":"{read_only:false; response_revision:3387310; number_of_response:1; }","duration":"127.524367ms","start":"2023-02-14T07:43:51.477Z","end":"2023-02-14T07:43:51.605Z","steps":["trace[783123863] 'process raft request'  (duration: 127.263362ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-14T07:43:55.769Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"122.944399ms","expected-duration":"100ms","prefix":"","request":"header:<ID:6257617818430884549 username:\"kube-apiserver-etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/leases/kube-system/kube-controller-manager\" mod_revision:3387324 > success:<request_put:<key:\"/registry/leases/kube-system/kube-controller-manager\" value_size:439 >> failure:<request_range:<key:\"/registry/leases/kube-system/kube-controller-manager\" > >>","response":"size:20"}
{"level":"info","ts":"2023-02-14T07:43:55.770Z","caller":"traceutil/trace.go:171","msg":"trace[109153835] transaction","detail":"{read_only:false; response_revision:3387339; number_of_response:1; }","duration":"128.66627ms","start":"2023-02-14T07:43:55.641Z","end":"2023-02-14T07:43:55.770Z","steps":["trace[109153835] 'compare'  (duration: 122.826285ms)"],"step_count":1}
{"level":"warn","ts":"2023-02-14T07:44:02.280Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"100.626724ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-serving/webhook.webhookcertificates.00-of-01\" ","response":"range_response_count:1 size:553"}
{"level":"info","ts":"2023-02-14T07:44:02.280Z","caller":"traceutil/trace.go:171","msg":"trace[1028192541] range","detail":"{range_begin:/registry/leases/knative-serving/webhook.webhookcertificates.00-of-01; range_end:; response_count:1; response_revision:3387376; }","duration":"100.734311ms","start":"2023-02-14T07:44:02.179Z","end":"2023-02-14T07:44:02.280Z","steps":["trace[1028192541] 'range keys from in-memory index tree'  (duration: 100.487699ms)"],"step_count":1}
{"level":"info","ts":"2023-02-14T07:44:03.448Z","caller":"traceutil/trace.go:171","msg":"trace[1989288108] linearizableReadLoop","detail":"{readStateIndex:3502667; appliedIndex:3502667; }","duration":"176.675242ms","start":"2023-02-14T07:44:03.272Z","end":"2023-02-14T07:44:03.448Z","steps":["trace[1989288108] 'read index received'  (duration: 176.667313ms)","trace[1989288108] 'applied index is now lower than readState.Index'  (duration: 6.576µs)"],"step_count":2}
{"level":"warn","ts":"2023-02-14T07:44:03.628Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"356.128666ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-eventing/controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01\" ","response":"range_response_count:1 size:649"}
{"level":"warn","ts":"2023-02-14T07:44:03.628Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"355.619515ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/leases/knative-eventing/eventing-webhook.sinkbindings.00-of-01\" ","response":"range_response_count:1 size:569"}
{"level":"info","ts":"2023-02-14T07:44:03.628Z","caller":"traceutil/trace.go:171","msg":"trace[1385736479] range","detail":"{range_begin:/registry/leases/knative-eventing/controller.knative.dev.eventing.pkg.reconciler.pingsource.reconciler.00-of-01; range_end:; response_count:1; response_revision:3387384; }","duration":"356.236612ms","start":"2023-02-14T07:44:03.272Z","end":"2023-02-14T07:44:03.628Z","steps":["trace[1385736479] 'agreement among raft nodes before linearized reading'  (duration: 176.793143ms)","trace[1385736479] 'range keys from in-memory index tree'  (duration: 179.263917ms)"],"step_count":2}

Pod: kube-apiserver-vhive-vishal-1, namespace: kube-system

I0214 07:46:31.759011       1 trace.go:205] Trace[1551177797]: "Call validating webhook" configuration:istio-validator-istio-system,webhook:rev.validation.istio.io,resource:networking.istio.io/v1alpha3, Resource=gateways,subresource:,operation:CREATE,UID:08ba3463-cd50-4ce7-8356-eb312be370a0 (14-Feb-2023 07:46:21.757) (total time: 10001ms):
Trace[1551177797]: [10.001103067s] [10.001103067s] END
W0214 07:46:31.759056       1 dispatcher.go:142] Failed calling webhook, failing open rev.validation.istio.io: failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": context deadline exceeded
E0214 07:46:31.759115       1 dispatcher.go:149] failed calling webhook "rev.validation.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/validate?timeout=10s": context deadline exceeded
I0214 07:46:31.763365       1 trace.go:205] Trace[1047610551]: "Create" url:/apis/networking.istio.io/v1alpha3/namespaces/istio-system/gateways,user-agent:pilot-discovery/1.12.5,audit-id:7c0e1722-385b-4b61-90da-c258f4ceed05,client:192.168.0.8,accept:application/json, */*,protocol:HTTP/2.0 (14-Feb-2023 07:46:21.756) (total time: 10006ms):
Trace[1047610551]: ---"Object stored in database" 10006ms (07:46:31.763)
Trace[1047610551]: [10.006895846s] [10.006895846s] END

I have tried to consolidate as much information pertaining to our infrastructure setup as possible. Do you think we are looking in the right direction? If not, could you suggest on how should we go about finding the root cause of this issue so as to successfully setup our test cluster?

If you think some other details would be helpful, please let us know

anshalshukla commented 1 year ago

Hey @vishalvrv9, which type of machine are you using on cloudlab and with how many nodes? Over the past week I've tried running it multiple times on xl170 Utah and m400 Mass nodes and haven't faced any issue with the code on the main branch. Still I'll go through the logs and see if I can find anything.

vishalvrv9 commented 1 year ago

Hey @anshalshukla

We have been using our own cluster that comes under NERC (New England Research Cloud) (https://nerc-project.github.io/nerc-docs/)

We two instances of cpu-a.8 as our choice of compute under the nova openstack offerings withing NERC.

More details on NERC Openstack compute offerings can be found here

Our lscpu looks something like this:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel Xeon Processor (Cascadelake)
Stepping:                        5
CPU MHz:                         2893.202
BogoMIPS:                        5786.40
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        32 MiB
L3 cache:                        128 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall
                                  nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma
                                 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm
                                 abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority
                                  ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap c
                                 lflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_
                                 clear arch_capabilities

Hope this helps.

ustiugov commented 1 year ago

@vishalvrv9 do you observe any errors when running the vHive setup scripts?

ustiugov commented 1 year ago

@vishalvrv9 is the issue still relevant?

vishalvrv9 commented 1 year ago

@ustiugov No significant errors were found when running the vHive setup scripts. I'm attaching the vhive logs from the remote vm via this comment.

I also tried setting up the vhive cluster with @anshalshukla (by following the exact same steps) but we did not really find root cause as to why the default-domain wasn't up.

Please find the logs attached below: containerd.stderr.txt containerd.stdout.txt create_multinode_cluster.stderr.txt create_multinode_cluster.stdout.txt setup_node.stderr.txt setup_node.stdout.txt

ustiugov commented 1 year ago

@NeHalem90 could you please take a look at the logs?

leokondrashov commented 1 year ago

@vishalvrv9 Is the issue still relevant? Recently, we have updated the setup process with Go instead of Bash scripts, improving the logs, may be that helps.

vishalvrv9 commented 1 year ago

@leokondrashov Thanks for the update. I will check on this and report my findings.

I just checked the setup readme and its still bash scripts.

Are you referring to the setup process under /scripts/cluster/* ?

leokondrashov commented 1 year ago

Are you referring to the setup process under /scripts/cluster/* ?

Yes, there is a README in /scripts/ directory as well.

leokondrashov commented 10 months ago

@vishalvrv9 Please reopen if still relevant