vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
290 stars 90 forks source link

Unable to connect worker node to Kubernetes cluster #924

Closed ymc101 closed 8 months ago

ymc101 commented 9 months ago

Describe the bug Error connecting worker node to Kubernetes cluster when executing following command:

sudo kubeadm join IP:PORT --token <Token> --discovery-token-ca-cert-hash <Token Hash> > >(tee -a /tmp/vhive-logs/kubeadm_join.stdout) 2> >(tee -a /tmp/vhive-logs/kubeadm_join.stderr >&2)

when following standard deployment steps in quickstart guide.

To Reproduce: Setting up 1 master and 1 worker node on 2 VMs running on the same computer (using VirtualBox), both running on Ubuntu 20.04, and following the steps in quickstart guide to "Setup a Serverless (Knative) Cluster" (standard setup, non-stargz).

Expected behaviour: Success message as shown in the quickstart guide:

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Logs: Error message after running above-mentioned command:

error execution phase preflight: couldn't validate the identity of the API Server: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
To see the stack trace of this error execute with --v=5 or higher

stack trace:

I0128 19:50:13.533056   35906 join.go:416] [preflight] found NodeName empty; using OS hostname as NodeName
I0128 19:50:13.648308   35906 initconfiguration.go:116] detected and using CRI socket: unix:///var/run/containerd/containerd.sock
[preflight] Running pre-flight checks
I0128 19:50:13.648559   35906 preflight.go:92] [preflight] Running general checks
I0128 19:50:13.648665   35906 checks.go:280] validating the existence of file /etc/kubernetes/kubelet.conf
I0128 19:50:13.648700   35906 checks.go:280] validating the existence of file /etc/kubernetes/bootstrap-kubelet.conf
I0128 19:50:13.648731   35906 checks.go:104] validating the container runtime
I0128 19:50:13.682419   35906 checks.go:329] validating the contents of file /proc/sys/net/bridge/bridge-nf-call-iptables
I0128 19:50:13.682515   35906 checks.go:329] validating the contents of file /proc/sys/net/ipv4/ip_forward
I0128 19:50:13.682590   35906 checks.go:644] validating whether swap is enabled or not
I0128 19:50:13.682645   35906 checks.go:370] validating the presence of executable crictl
I0128 19:50:13.682680   35906 checks.go:370] validating the presence of executable conntrack
I0128 19:50:13.682694   35906 checks.go:370] validating the presence of executable ip
I0128 19:50:13.682720   35906 checks.go:370] validating the presence of executable iptables
I0128 19:50:13.682736   35906 checks.go:370] validating the presence of executable mount
I0128 19:50:13.682751   35906 checks.go:370] validating the presence of executable nsenter
I0128 19:50:13.682766   35906 checks.go:370] validating the presence of executable ebtables
I0128 19:50:13.682784   35906 checks.go:370] validating the presence of executable ethtool
I0128 19:50:13.682797   35906 checks.go:370] validating the presence of executable socat
I0128 19:50:13.682812   35906 checks.go:370] validating the presence of executable tc
I0128 19:50:13.682824   35906 checks.go:370] validating the presence of executable touch
I0128 19:50:13.682850   35906 checks.go:516] running all checks
I0128 19:50:13.697631   35906 checks.go:401] checking whether the given node name is valid and reachable using net.LookupHost
I0128 19:50:13.697778   35906 checks.go:610] validating kubelet version
I0128 19:50:13.803659   35906 checks.go:130] validating if the "kubelet" service is enabled and active
I0128 19:50:13.828442   35906 checks.go:203] validating availability of port 10250
I0128 19:50:13.828671   35906 checks.go:280] validating the existence of file /etc/kubernetes/pki/ca.crt
I0128 19:50:13.828687   35906 checks.go:430] validating if the connectivity type is via proxy or direct
I0128 19:50:13.828725   35906 join.go:533] [preflight] Discovering cluster-info
I0128 19:50:13.828758   35906 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "10.0.2.15:6443"
I0128 19:50:13.833519   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:19.741530   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:26.153917   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:32.158012   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:37.815374   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:43.456764   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:49.491954   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:54.591242   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:50:59.826824   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:04.973252   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:10.430516   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:16.210377   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:22.431827   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:27.754169   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:33.326170   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:38.804194   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:44.508407   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:49.933881   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:51:55.375472   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:01.396005   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:06.724873   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:12.030747   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:17.578990   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:23.435949   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:29.731862   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:35.172425   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:40.622833   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:46.754587   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:52.065856   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:52:58.364681   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:04.411937   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:10.198511   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:15.241839   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:20.480293   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:26.391994   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:32.855728   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:37.979230   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:43.872375   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:48.966009   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:53:55.004649   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:00.462038   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:05.723279   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:11.536063   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:17.353274   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:22.777059   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:28.412840   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:34.215064   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:39.601054   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:45.030075   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:51.218342   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:54:56.766033   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:55:03.093217   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
I0128 19:55:08.544057   35906 token.go:217] [discovery] Failed to request cluster-info, will try again: Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
Get "https://10.0.2.15:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp 10.0.2.15:6443: connect: connection refused
couldn't validate the identity of the API Server
k8s.io/kubernetes/cmd/kubeadm/app/discovery.For
    cmd/kubeadm/app/discovery/discovery.go:45
k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*joinData).TLSBootstrapCfg
    cmd/kubeadm/app/cmd/join.go:534
k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*joinData).InitCfg
    cmd/kubeadm/app/cmd/join.go:544
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runPreflight
    cmd/kubeadm/app/cmd/phases/join/preflight.go:97
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
    cmd/kubeadm/app/cmd/join.go:181
github.com/spf13/cobra.(*Command).execute
    vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
    vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
    vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
    cmd/kubeadm/app/kubeadm.go:50
main.main
    cmd/kubeadm/kubeadm.go:25
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
error execution phase preflight
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
    cmd/kubeadm/app/cmd/join.go:181
github.com/spf13/cobra.(*Command).execute
    vendor/github.com/spf13/cobra/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
    vendor/github.com/spf13/cobra/command.go:974
github.com/spf13/cobra.(*Command).Execute
    vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
    cmd/kubeadm/app/kubeadm.go:50
main.main
    cmd/kubeadm/kubeadm.go:25
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
leokondrashov commented 9 months ago

@ymc101, please provide following information for us to understand the problem:

ymc101 commented 9 months ago

Hi @leokondrashov , the VMs I was previously running has been terminated, I will replicate the setup later and get back to you with the information.

ymc101 commented 9 months ago

Hi @leokondrashov , below is the requested information:

ifconfig -a (Worker VM):

enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.2.15  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::d05:3fef:736a:ea3e  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:f0:ba:60  txqueuelen 1000  (Ethernet)
        RX packets 620461  bytes 870904986 (870.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 96079  bytes 6573845 (6.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 555  bytes 46847 (46.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 555  bytes 46847 (46.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth0-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::d0bc:5ff:fe84:7041  prefixlen 64  scopeid 0x20<link>
        ether d2:bc:05:84:70:41  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 59  bytes 7193 (7.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth1-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.5  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::b41b:b1ff:fe72:2311  prefixlen 64  scopeid 0x20<link>
        ether b6:1b:b1:72:23:11  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 59  bytes 7193 (7.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth2-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.9  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::6076:b5ff:fed5:92ca  prefixlen 64  scopeid 0x20<link>
        ether 62:76:b5:d5:92:ca  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7264 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth3-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.13  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::587b:78ff:feaa:9dd9  prefixlen 64  scopeid 0x20<link>
        ether 5a:7b:78:aa:9d:d9  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth4-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.17  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::c0cf:3ff:fe99:2c1a  prefixlen 64  scopeid 0x20<link>
        ether c2:cf:03:99:2c:1a  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 61  bytes 7360 (7.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth5-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.21  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::3072:fbff:fe84:a7e0  prefixlen 64  scopeid 0x20<link>
        ether 32:72:fb:84:a7:e0  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth6-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.25  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::2cf5:9dff:fefb:654f  prefixlen 64  scopeid 0x20<link>
        ether 2e:f5:9d:fb:65:4f  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth7-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.29  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::38cc:d8ff:fe46:366a  prefixlen 64  scopeid 0x20<link>
        ether 3a:cc:d8:46:36:6a  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth8-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.33  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::3494:6eff:fea8:931d  prefixlen 64  scopeid 0x20<link>
        ether 36:94:6e:a8:93:1d  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth9-1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.37  netmask 255.255.255.252  broadcast 0.0.0.0
        inet6 fe80::f8c3:5bff:fe8f:9410  prefixlen 64  scopeid 0x20<link>
        ether fa:c3:5b:8f:94:10  txqueuelen 1000  (Ethernet)
        RX packets 14  bytes 1076 (1.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60  bytes 7270 (7.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ifconfig -a (Master VM):

enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.2.15  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::9bbc:268f:5cad:ba73  prefixlen 64  scopeid 0x20<link>
        ether 08:00:27:8a:97:2a  txqueuelen 1000  (Ethernet)
        RX packets 729420  bytes 1041691890 (1.0 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 161699  bytes 10474864 (10.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 206981  bytes 32255875 (32.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 206981  bytes 32255875 (32.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

sudo lsof info: There are many lines where it shows some process listening to localhost:6443

~/.kube/config (Master VM)

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJME1ESXdNVEUwTURrMU9Wb1hEVE0wTURFeU9URTBNRGsxT1Zvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBSks0CmpLTlVTSEIveWtjRGYvZmJWVWZ2Q1I3Sk01Qm12MVY4SmdGdGpMUENucEEzY0FJcmRIMUlpcmhHZkR2a1VyQ0IKdjZGL3pLS0FGaWZubzllSG8vbE1NakxPZkE5Z1ZLYjVDZkZRbzIrNXZtM2Exc0xmZHlDMDBvbkdVRmxZNFhrRgp3RFNCSW0vUk8vZ1NHZGcwaGUweUFQalg3Y2x5S216Wm92M0lYaGZlbEFFWU1iOTJtOFcyU2RVUnJtNXk1K2d1CitnMXB5dGxuNjVOZmprMm1xU0plMlJMWFZwVTdMSHpnTEdtSS9QV0hwM3c4enlES08xUHJrZmlKaHNtb0lqM0sKc2c5MFU2RGhDYVZ4bGlLdXJZVmYzVlllNkk5MnY3QS9UQkV6UGlzZ0dNYlZoaXB3T05BRU5yL001YURETGNMbQpwTEU2RWxxNzNuNytJa1FQVk1FQ0F3RUFBYU5aTUZjd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZNVTVYbitONEtzRStjTXVETlRnNlplQWdKOXRNQlVHQTFVZEVRUU8KTUF5Q0NtdDFZbVZ5Ym1WMFpYTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBRUV5bkVTckFNcUYyWHpYd1JobgpVbVFKWUVLRVZBOE96Vm1vdUlrcmowbTMrSmhZdGlUZVVJd1JVMDdEVzh4Ui9IQSsrTUlkN0FsMW5VN2tlN29NCkJsMGxQZUxXekcwN1NYNW5ZVDg2WjlubVoyNWEzTDFsaDFiTFN6YW9sanRsTnJ2ZFBNRkZGZmN3RW9QWERUbm0KZUFpaU9RTW1HbXRyZkcvT09hOVZ0NDEyOUo3NEJQMm0vL0lZeHdBeWkxaEpRWGJBQWorS3NPK1FpWmhtSHFrZQpxRFoyQmhUZnF2bHZjR29pNGJhRCtESnVodHZhU2VDcUtDek5Td0NCSEhjT0Z5RWw3Q1dPcWh4czgvVkxVYTNXCjVjVW13YTRsVWJkREFJY3FoWlZoWmhvVmk3UTdGbDVTZEl4Y1dnL0lqemhEVXBTeE8zdUxvM280a0RUaTVpUUcKTDIwPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    server: https://10.0.2.15:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURJVENDQWdtZ0F3SUJBZ0lJS3ZGRVZ6UHRISVV3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TkRBeU1ERXhOREE1TlRsYUZ3MHlOVEF4TXpFeE5ERXdNREphTURReApGekFWQmdOVkJBb1REbk41YzNSbGJUcHRZWE4wWlhKek1Sa3dGd1lEVlFRREV4QnJkV0psY201bGRHVnpMV0ZrCmJXbHVNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QU1JSUJDZ0tDQVFFQXgzTE5XcE5PY3dHd3BWODYKWVgzNlNIZUJoTWZZYTgyT25GdWpjdFQ3emJRM2c3Nk55RWtBUlZKeFl1VkVORTRMblZ0clc1bzdoQUxWbW8wNQpFSlRybnk3UFVuZ0Y1Wm9JZmpkdmJJaHk5WlY2VWg0ODByKzVvVlFDVDEzdCtRSW8yWWt1c3VQSWZnK01JRnJQCjNibldRQUhaeG5PMnFacVJES2dIaXBJU1RkRzNIeWFYUlJmTi9LdFhxUk02S3h4c0Y3b2paUWp6T2Vwb09LT0oKNUZqNCt5TzVnVjkweGE3V0hHSUN6YVhtZzdnajZ4UkNTUmRjbDJUeEtxak0xRlJkSEE3L0VvcmNpZU9PQWVBRgpPTW4xZE13R3FQQ2gyYUI5V2xKcnF4SWI3anI5NStsWFlTTzYrTi9ZczM0OXJyS2pPM3Jja2U3dFl0TTZWeW1xCjR2L2hCd0lEQVFBQm8xWXdWREFPQmdOVkhROEJBZjhFQkFNQ0JhQXdFd1lEVlIwbEJBd3dDZ1lJS3dZQkJRVUgKQXdJd0RBWURWUjBUQVFIL0JBSXdBREFmQmdOVkhTTUVHREFXZ0JURk9WNS9qZUNyQlBuRExnelU0T21YZ0lDZgpiVEFOQmdrcWhraUc5dzBCQVFzRkFBT0NBUUVBTU0ydTduVlFaTFZVWTl4QXZ6WG9aSHJVZmRNL1N0L2w3RkxVCnc3aENXN3JpZ1R5L0ovTloyNC9VRjVqMGo2WWM0cTEyRWpYb0gydEZkczN0MFlmdUNrRWU2TVF4MXliNG82M2QKVTNVUDgvVE9BREx5UEJEcERXK1Q2YkhGSDc4TTdYSHl6SU40SXVNY3dOTkhmWXl5R2RmYWZub0RqVEFYZnBJcgpxdENYKzYwQUN1T3AyOUZ5Ui81MElPSmRKSnRrM1gra0NlbFc5V3ZObThqMGZkTkxVbFBvaldTYzF3NERZeHJWCmJHTjBuL1hRTVJPQ2NrRGFmdzRUT1ptZ3FTVGNyV2lTMkhsdHYrK1BYRkxOSW5XNktvbFpMbjIxRTFBdGtGRkQKejJxc3dQY0NnNWsxL2F0b0hWUlgrdjhFVHNJSUU0ZnF2NW5BVXVYWXNyc25nbXdiaEE9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBeDNMTldwTk9jd0d3cFY4NllYMzZTSGVCaE1mWWE4Mk9uRnVqY3RUN3piUTNnNzZOCnlFa0FSVkp4WXVWRU5FNExuVnRyVzVvN2hBTFZtbzA1RUpUcm55N1BVbmdGNVpvSWZqZHZiSWh5OVpWNlVoNDgKMHIrNW9WUUNUMTN0K1FJbzJZa3VzdVBJZmcrTUlGclAzYm5XUUFIWnhuTzJxWnFSREtnSGlwSVNUZEczSHlhWApSUmZOL0t0WHFSTTZLeHhzRjdvalpRanpPZXBvT0tPSjVGajQreU81Z1Y5MHhhN1dIR0lDemFYbWc3Z2o2eFJDClNSZGNsMlR4S3FqTTFGUmRIQTcvRW9yY2llT09BZUFGT01uMWRNd0dxUENoMmFCOVdsSnJxeEliN2pyOTUrbFgKWVNPNitOL1lzMzQ5cnJLak8zcmNrZTd0WXRNNlZ5bXE0di9oQndJREFRQUJBb0lCQVFDODE3SWdKSUdPMnZiSwpYZFFGSXlhckhwdi9nTWtscVVkeVBFSVNKQjhXc2FBdW1XbmRUV0Y0UVlzaVBEbkwzR21hNEVoU1AwSkN4L3cvCmpaK09WN0tRMGQxekZEbGhIK3NTdHFKRmZSeDc4c0FTcUphbVpPbjZHblRsZU9ZdGN5SUNkcVZFcysvTmpDTDkKTDM3SlRYL1NzdTNqdlFRaXFqclVaUFJlKzlkZzNaZEFrWTVaZGNzQnZZRGxvQm1waCtRTitCWXQveHJvMTNqKwpuRUhJSnNOb3BZeSs2MFlIaWdRbFZSWlZwcHdjcVYvMnpOaE9Ma1JCeXAyVmhDZGliMDZDR21PNEo1QURmczdmCkJ5cmhXSUZJQno2TXRXajJnRkF5TEY1czYyZkZzY0RMYUc5SG5NemRVTE5qSmpnWEt1T2dtajFuNHE2dWdHTmwKWTM3ME9yUUJBb0dCQU5MMDhOWWVPNVVTZy9HckhNdVp5ellLRkZHUmx2bGQ5Yjh3REdUenpNS0pJelhEYkdPTwp3L1JOeENqOWR5Zy9FNXB2R0JZQldDODBwNjJKTURNV2JKcGdybzVIaGxRQ3czekh0L3BwK3JXSFBJbWJHd1lOCmg4Um9VYTVldTA4aXB1SWxWMWZURVQveHJnbnlpcFd4SzNrb3hhOHhWcUpwMGNzc2FTaXhEbmlIQW9HQkFQSUkKemxiUXF3OWhYVVI2djdKZWF2SzdtUWVkWlJtNXIrZDRPU1ZJMVN2TnhHeU42VWNJaTc5bkN3dy8yWVNsS0pHOApGOVdoZWNJei9qUUpHU29DdWFwM3Z5UEVNRkF0S042b0ZoUnhiUGUrcUtrTW9VR2xSeVlVeVpIZ0Y4N1NDb2o3CjRLL0VKbDBtZGI2aHNKZ1RrQkJseWY2dS96VUExTjdIK1U3YmkvT0JBb0dBRlVvdS9BejFDbWhoOUlQR1ZpM2gKT2tUdUpBVkRiVXMwUCtWRGV2UzMxM0lyb1lObGJ1NjdpKzVGTzdYSXpzRCs0M2tPdnpuSGdvd1gyQVdlWGFtSApzRlROaVFKaTVodVpTd0NFNnJyRFdJcWJhMi9CM0d5RkpTYzZCeFQ4WmxJaThYTy9TdGU4UiszR0dLN25tWS9WCnlWWjZET0kzMGhCSDRlOUxkWlhZMWdVQ2dZRUE3M0Fxd05QYUJuTXAwNDhqaVkvQ2ViT0E1bm1kQk9BZjF2dW0KZk80YWhTVWhCc3MxVmlKc0xjUUF0L09LZXFEeEM0dHFnTnNvR3lsWWQ1M3dtUkR0SUdrcVhIVy8zZkZ2RnladQpBWGRjZDVMVVE3ak01cVpkUnAwVjlBd2ZRV21sSm5NWGlvcWY4Vk1VOUt2OGlkWUFsVmc5aG9rVXpCaXdmbHlTCmxLSzVSd0VDZ1lCRXVUWTFzSnVEZzlIVGRPZ2w5QnhudDgzYVF2WjY2SDVaeVpBVVFqOEJTdkJhQnFWU2xFbkwKaER5d3pjWkk0YmNSS1NDeDZqemc0aHE1UHdHNm9oTWIxWThodjU2akhickp4VVpsaDFIVk44cjZxUzBHTGZkYgpxZ1pPNWE3elZ0eGNBS2RWTkxCV1NRVzlkTFFWRXE2Q0g0dW1rbUQvR1lqZ0hEayt0NGZqZXc9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=

/etc/kubernetes/admin.conf (Master VM):

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvak>
    server: https://10.0.2.15:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURJVENDQ>
    client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS>

Worker VM: .kube folder not present, etc/kubernetes/admin.conf file content is empty

leokondrashov commented 9 months ago

Hi, @ymc101. The data provided looks fine. The only reason I can think of is that the firewall is in place and/or ports are blocked. Can you check if port 6443 is whitelisted?

ymc101 commented 9 months ago

Hi @leokondrashov , the ports were apparently blocked due to the VM configuration. After solving it, the worker node was able to join with the master node, but there were some Error with configuring MetalLB. Below are some Terminal logs:

Worker node when joining:

tee: /tmp/vhive-logs/kubeadm_join.stdout: No such file or directory
tee: /tmp/vhive-logs/kubeadm_join.stderr: No such file or directory
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Here I forgot to create the temp log files before running, but it seems that it did not affect the joining of the node.

Master node:

[16:53:29] [Warn] All nodes need to be joined in the cluster. Have you joined all nodes? (y/n): y
[17:00:02] [Success] All nodes successfully joined!(user confirmed)
[17:00:02] [Info] Set up master node
[17:00:02] [Info] Installing pod network >>>>> [17:00:06] [Success] 
[17:00:06] [Info] Installing and configuring MetalLB >>>>> [17:03:12] [Error] [exit 1] -> error: timed out waiting for the condition on deployments/controller
[17:03:12] [Error] Failed to install and configure MetalLB!
[17:03:12] [Error] Failed to set up master node!
[17:03:12] [Error] Faild subcommand: create_multinode_cluster!
[17:03:12] [Info] Cleaning up temporary directory >>>>> [17:03:12] [Success] 

I had automatic ssh set up for both master and worker node. Do you have an idea what might be causing this error?

leokondrashov commented 9 months ago

It's good that the networking issue was that simple to resolve. We should add it to the troubleshooting guide.

Regarding the metallb, I saw that stuff once. I think it's just a sporadic error. First, check the available pods, metallb might be there, just too late to report to the script. Either way, try to rerun ./setup_tool setup_master_node firecracker. It might be exacerbated by slow network, so also check the connection speed to VMs.

ymc101 commented 9 months ago

The networking issue was caused by a VM setting in VirtualBox, so nothing to do with vHive itself.

Regarding the metallb error, i tried running the setup tool command again and it passed the check, however this time there is an error with deploying the istio operator:

[20:25:23] [Info] Deploying istio operator >>>>> [20:36:14] [Error] [exit 1] -> ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See https://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead

- Processing resources for Istio core.
✔ Istio core installed
- Processing resources for Istiod.
- Processing resources for Istiod. Waiting for Deployment/istio-system/i...
✘ Istiod encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition

- Processing resources for Ingress gateways.
- Processing resources for Ingress gateways. Waiting for Deployment/isti...
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
  Deployment/istio-system/cluster-local-gateway (container failed to start: ContainerCreating: )
  Deployment/istio-system/istio-ingressgateway (container failed to start: ContainerCreating: )
- Pruning removed resourcesError: failed to install manifests: errors occurred during operation
[20:36:14] [Error] Failed to deploy istio operator!

@leokondrashov do you have any idea for this one?

Thanks.

leokondrashov commented 9 months ago

Can you please check the pods in the istio-system (kubectl get pods -n istio-system)? It might be the same problem as with metallb: the pods are on the way, but too late to fit in the timeout. If they are not ready (but they should be at least listed there in non-ready state), we can try to check the logs of the pods (kubectl logs <pod name> -n istio-system).

ymc101 commented 9 months ago

I ran into a different metallb error this time:

[18:41:53] [Error] [exit 1] -> Error from server (InternalError): error when creating "/home/vboxuser/vhive/configs/metallb/metallb-ipaddresspool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
[18:41:53] [Error] Failed to install and configure MetalLB!
[18:41:53] [Error] Failed to set up master node!
[18:41:53] [Error] Faild subcommand: create_multinode_cluster!

When i run the setup ./setup_tool setup_master_node firecracker script and it fails, i realised that when i run it again i get an index out of range error. So what i do is i run the cleanup script ./scripts/github_runner/clean_cri_runner.sh , and then run all the sudo screen commands (containerd, firecracker-containerd, vhive for worker, and containerd for master), and then run the ./setup_tool setup_master_node firecracker script again. Do you know if i am missing or doing any wrong steps?

leokondrashov commented 8 months ago

I'm not very confident in using the clean cri script for a multi-node setup. It's better to start from clean nodes.

Let's figure out the problems that we face. Please run from the start and document the errors that you encounter. For istio and metallb failures, please provide the output of kubectl get pods -n <>-system (substitute istio or metallb).

Most of our problems are timeouts of resources not being ready. Can you check the networking speed of the VMs?

ymc101 commented 8 months ago

Using speedtest-cli, this is the networking speed from one of my VMs:

Testing download speed................................................................................
Download: 601.03 Mbit/s
Testing upload speed......................................................................................................
Upload: 572.86 Mbit/s

Is this within sufficient ranges with respect to the timeout timers in the master node setup script?

Additionally, can I ask if there is a script or a method to reset a node after setup failure, or after usage of a node? It is quite time consuming to tear down the VM and set up a new one each time due to the initial OS setup. So far I have tried using the single-node clean cri script as I could not find any other cleanup script from the quickstart guide.

leokondrashov commented 8 months ago

That seems to be a fair speed for the setup.

You are using VirtualBox, right? Can you create a snapshot of the VM right after the boot? That should speed up the process.

ymc101 commented 8 months ago

Yes, I am using VirtualBox, I was previously not aware of this feature, thanks for the suggestion.

ymc101 commented 8 months ago

I tried running it from scratch and got the same metallb timeout error, and when i tried to rerun the command i get this index out of range panic:

panic: runtime error: index out of range [1] with length 1

goroutine 1 [running]:
github.com/vhive-serverless/vHive/scripts/cluster.ExtractMasterNodeInfo()
    /home/vboxuser/vhive/scripts/cluster/create_multinode_cluster.go:146 +0x66c
github.com/vhive-serverless/vHive/scripts/cluster.CreateMultinodeCluster({0x7fff55e8a3bd, 0xb})
    /home/vboxuser/vhive/scripts/cluster/create_multinode_cluster.go:50 +0x53
main.main()
    /home/vboxuser/vhive/scripts/setup.go:151 +0xfb7

If the metallb was ready but not in time for the script, resetting the VM snapshot might produce the same error. Do you have any ideas or suggestion? Right now I can only think of modifying the code on the master node to increase the timeout threshold for the metallb and istio setup, but im not sure what might be causing this, especially when the download and upload speed doesn't seem to be the bottleneck here.

leokondrashov commented 8 months ago

I saw the timeout issue previously with network congestion, which is not the case here. However, there might also be the problem of not enough CPU to install it in time. What is the VM size?

The solution with more time would work, although the current limit of 3 minutes should be more than enough. Not sure that it can be done for istio (that also experienced timeouts), so a more permanent solution might include increasing the VM size.

ymc101 commented 8 months ago

I allocated 3 CPU cores and 8GB of RAM to this VM. Ill try again with bigger VM size and then see if there is the same issue. This system has 12 cores and 32GB RAM, I can give each VM about 4-5 cores at most and about 12GB RAM.

ymc101 commented 8 months ago

The script encountered the same error with metallb, even with 5 cores and 12GB RAM, which is the maximum I can allocate to each VM before exceeding the system's hardware resources. May I know what was the specs of the nodes that you have tested on before?

leokondrashov commented 8 months ago

We commonly use nodes with around 10 cores and 64GB, but your configuration should be enough.

Can you supply the content of the create_multinode_cluster_*.log in directory where you ran setup_tool? Maybe even add --v 5 to increase verbosity of failing commands (https://github.com/vhive-serverless/vHive/blob/main/scripts/cluster/setup_master_node.go#L125-L133).

ymc101 commented 8 months ago

Below are the content of the 2 log files i just ran with the verbosity flag:

create_multinode_cluster_common.log:

INFO: 18:21:50 logs.go:88: 
INFO: 18:21:50 logs.go:88: Stdout Log -> /home/vboxuser/vhive/create_multinode_cluster_common.log
INFO: 18:21:50 logs.go:88: Stderr Log -> /home/vboxuser/vhive/create_multinode_cluster_error.log
INFO: 18:21:50 system.go:81: Executing shell command: git rev-parse --show-toplevel
INFO: 18:21:50 system.go:82: Stdout from shell:
/home/vboxuser/vhive
INFO: 18:21:50 logs.go:100: vHive repo Path: /home/vboxuser/vhive
INFO: 18:21:50 logs.go:100: Loading config files from /home/vboxuser/vhive/configs/setup >>>>> 
INFO: 18:21:50 logs.go:88: 
INFO: 18:21:50 logs.go:100: Create multinode cluster
INFO: 18:21:50 logs.go:100: Creating kubelet service >>>>> 
INFO: 18:21:50 system.go:81: Executing shell command: sudo mkdir -p /etc/sysconfig
INFO: 18:21:50 system.go:82: Stdout from shell:

INFO: 18:21:50 system.go:81: Executing shell command: sudo sh -c 'cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--container-runtime=remote --v=0 --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
EOF'
INFO: 18:21:50 system.go:82: Stdout from shell:

INFO: 18:21:51 system.go:81: Executing shell command: sudo systemctl daemon-reload
INFO: 18:21:51 system.go:82: Stdout from shell:

INFO: 18:21:51 logs.go:88: 
INFO: 18:21:51 logs.go:100: Deploying Kubernetes(version 1.25.9) >>>>> 
INFO: 18:21:51 system.go:81: Executing shell command: ip route | awk '{print $(NF)}' | awk '/^10\..*/'
INFO: 18:21:51 system.go:82: Stdout from shell:

INFO: 18:25:03 system.go:81: Executing shell command: sudo kubeadm init --v=0 \
--apiserver-advertise-address= \
--cri-socket /run/containerd/containerd.sock \
--kubernetes-version 1.25.9 \
--pod-network-cidr="192.168.0.0/16" | tee /tmp/vHive_tmp1120224528/masterNodeInfo
INFO: 18:25:03 system.go:82: Stdout from shell:
[init] Using Kubernetes version: v1.25.9
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local vhivemaster] and IPs [10.96.0.1 10.100.184.85]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost vhivemaster] and IPs [10.100.184.85 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost vhivemaster] and IPs [10.100.184.85 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[apiclient] All control plane components are healthy after 69.271402 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node vhivemaster as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node vhivemaster as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: wpefqy.ta83yflrwktaqneg
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 10.100.184.85:6443 --token wpefqy.ta83yflrwktaqneg \
    --discovery-token-ca-cert-hash sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1 
INFO: 18:25:03 logs.go:88: 
INFO: 18:25:03 logs.go:100: Making kubectl work for non-root user >>>>> 
INFO: 18:25:03 system.go:81: Executing shell command: mkdir -p /home/vboxuser/.kube && sudo cp -i /etc/kubernetes/admin.conf /home/vboxuser/.kube/config && sudo chown $(id -u):$(id -g) /home/vboxuser/.kube/config
INFO: 18:25:03 system.go:82: Stdout from shell:

INFO: 18:25:03 logs.go:88: 
INFO: 18:25:03 logs.go:100: Extracting master node information from logs >>>>> 
INFO: 18:25:03 system.go:81: Executing shell command: sed -n '/.*kubeadm join.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*join \(.*\):\(\S*\) --token \(\S*\).*/\1 \2 \3/p'
INFO: 18:25:03 system.go:82: Stdout from shell:
10.100.184.85 6443 wpefqy.ta83yflrwktaqneg
INFO: 18:25:03 system.go:81: Executing shell command: sed -n '/.*sha256:.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*\(sha256:\S*\).*/\1/p'
INFO: 18:25:03 system.go:82: Stdout from shell:
sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1
INFO: 18:25:03 logs.go:88: 
INFO: 18:25:03 logs.go:100: Creating masterKey.yaml with master node information >>>>> 
INFO: 18:25:03 logs.go:88: 
INFO: 18:25:03 logs.go:88: Join cluster from worker nodes with command: sudo kubeadm join 10.100.184.85:6443 --token wpefqy.ta83yflrwktaqneg --discovery-token-ca-cert-hash sha256:37acf38fa718d3019ea68467411f7b455f0c21bfb6a5e9360913cd54e4e139e1
INFO: 18:25:03 logs.go:76: All nodes need to be joined in the cluster. Have you joined all nodes? (y/n): 
INFO: 18:31:33 logs.go:88: All nodes successfully joined!(user confirmed)
INFO: 18:31:33 logs.go:100: Set up master node
INFO: 18:31:33 logs.go:100: Installing pod network >>>>> 
INFO: 18:31:52 system.go:81: Executing shell command: kubectl apply -f /home/vboxuser/vhive/configs/calico/canal.yaml
INFO: 18:31:52 system.go:82: Stdout from shell:
poddisruptionbudget.policy/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
serviceaccount/calico-node created
configmap/calico-config created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/blockaffinities.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/caliconodestatuses.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamblocks.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamconfigs.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipamhandles.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ipreservations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/kubecontrollersconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networksets.crd.projectcalico.org created
clusterrole.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrole.rbac.authorization.k8s.io/calico-node created
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
deployment.apps/calico-kube-controllers created
INFO: 18:31:52 logs.go:88: 
INFO: 18:31:52 logs.go:100: Installing and configuring MetalLB >>>>> 
INFO: 18:31:54 system.go:81: Executing shell command: kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
INFO: 18:31:54 system.go:82: Stdout from shell:
configmap/kube-proxy configured
INFO: 18:32:32 system.go:81: Executing shell command: kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.9/config/manifests/metallb-native.yaml
INFO: 18:32:32 system.go:82: Stdout from shell:
namespace/metallb-system created
customresourcedefinition.apiextensions.k8s.io/addresspools.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bfdprofiles.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgpadvertisements.metallb.io created
customresourcedefinition.apiextensions.k8s.io/bgppeers.metallb.io created
customresourcedefinition.apiextensions.k8s.io/communities.metallb.io created
customresourcedefinition.apiextensions.k8s.io/ipaddresspools.metallb.io created
customresourcedefinition.apiextensions.k8s.io/l2advertisements.metallb.io created
serviceaccount/controller created
serviceaccount/speaker created
role.rbac.authorization.k8s.io/controller created
role.rbac.authorization.k8s.io/pod-lister created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/controller created
rolebinding.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
secret/webhook-server-cert created
service/webhook-service created
deployment.apps/controller created
daemonset.apps/speaker created
validatingwebhookconfiguration.admissionregistration.k8s.io/metallb-webhook-configuration created
INFO: 18:35:33 system.go:81: Executing shell command: kubectl -n metallb-system wait deploy controller --timeout=180s --for=condition=Available
INFO: 18:35:33 system.go:82: Stdout from shell:

INFO: 18:35:33 logs.go:100: Cleaning up temporary directory >>>>> 
INFO: 18:35:33 logs.go:88: 
ymc101 commented 8 months ago

create_multinode_cluster_error.log:

ERROR: 18:21:50 system.go:85: Executing shell command: git rev-parse --show-toplevel
ERROR: 18:21:50 system.go:86: Stderr from shell:

ERROR: 18:21:50 system.go:85: Executing shell command: sudo mkdir -p /etc/sysconfig
ERROR: 18:21:50 system.go:86: Stderr from shell:

ERROR: 18:21:50 system.go:85: Executing shell command: sudo sh -c 'cat <<EOF > /etc/sysconfig/kubelet
KUBELET_EXTRA_ARGS="--container-runtime=remote --v=0 --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
EOF'
ERROR: 18:21:50 system.go:86: Stderr from shell:

ERROR: 18:21:51 system.go:85: Executing shell command: sudo systemctl daemon-reload
ERROR: 18:21:51 system.go:86: Stderr from shell:

ERROR: 18:21:51 system.go:85: Executing shell command: ip route | awk '{print $(NF)}' | awk '/^10\..*/'
ERROR: 18:21:51 system.go:86: Stderr from shell:

ERROR: 18:25:03 system.go:85: Executing shell command: sudo kubeadm init --v=0 \
--apiserver-advertise-address= \
--cri-socket /run/containerd/containerd.sock \
--kubernetes-version 1.25.9 \
--pod-network-cidr="192.168.0.0/16" | tee /tmp/vHive_tmp1120224528/masterNodeInfo
ERROR: 18:25:03 system.go:86: Stderr from shell:
W0216 18:21:51.426910   32124 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
ERROR: 18:25:03 system.go:85: Executing shell command: mkdir -p /home/vboxuser/.kube && sudo cp -i /etc/kubernetes/admin.conf /home/vboxuser/.kube/config && sudo chown $(id -u):$(id -g) /home/vboxuser/.kube/config
ERROR: 18:25:03 system.go:86: Stderr from shell:

ERROR: 18:25:03 system.go:85: Executing shell command: sed -n '/.*kubeadm join.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*join \(.*\):\(\S*\) --token \(\S*\).*/\1 \2 \3/p'
ERROR: 18:25:03 system.go:86: Stderr from shell:

ERROR: 18:25:03 system.go:85: Executing shell command: sed -n '/.*sha256:.*/p' < /tmp/vHive_tmp1120224528/masterNodeInfo | sed -n 's/.*\(sha256:\S*\).*/\1/p'
ERROR: 18:25:03 system.go:86: Stderr from shell:

ERROR: 18:31:52 system.go:85: Executing shell command: kubectl apply -f /home/vboxuser/vhive/configs/calico/canal.yaml
ERROR: 18:31:52 system.go:86: Stderr from shell:

ERROR: 18:31:54 system.go:85: Executing shell command: kubectl get configmap kube-proxy -n kube-system -o yaml | sed -e "s/strictARP: false/strictARP: true/" | kubectl apply -f - -n kube-system
ERROR: 18:31:54 system.go:86: Stderr from shell:
Warning: resource configmaps/kube-proxy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
ERROR: 18:32:32 system.go:85: Executing shell command: kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.9/config/manifests/metallb-native.yaml
ERROR: 18:32:32 system.go:86: Stderr from shell:

ERROR: 18:35:33 system.go:85: Executing shell command: kubectl -n metallb-system wait deploy controller --timeout=180s --for=condition=Available
ERROR: 18:35:33 system.go:86: Stderr from shell:
error: timed out waiting for the condition on deployments/controller
ERROR: 18:35:33 logs.go:64: [exit 1] -> error: timed out waiting for the condition on deployments/controller
ERROR: 18:35:33 logs.go:64: Failed to install and configure MetalLB!
ERROR: 18:35:33 logs.go:64: Failed to set up master node!
ERROR: 18:35:33 logs.go:64: Faild subcommand: create_multinode_cluster!
leokondrashov commented 8 months ago

Can you provide the output of the command kubectl describe pod controller -n metallb-system after it actually places the pod? In the end of the output there should be the events that might explain why the deployment is delayed.

ymc101 commented 8 months ago

Do I run that command on the worker node right after it joins the cluster, and before I provide the user prompt on the master node (./setup_tool create_multinode_cluster firecracker) to confirm all pods have joined the cluster?

leokondrashov commented 8 months ago

After it fails to deploy the metallb services.

ymc101 commented 8 months ago

This is the output i got:

Name:             controller-844979dcdc-hhk5d
Namespace:        metallb-system
Priority:         0
Service Account:  controller
Node:             vhiveworker/10.100.183.218
Start Time:       Mon, 19 Feb 2024 22:17:42 +0800
Labels:           app=metallb
                  component=controller
                  pod-template-hash=844979dcdc
Annotations:      cni.projectcalico.org/containerID: 3ab842fcceec22b99646560f9eed7bb655ff0553beb9c23d23749c7db5e99171
                  cni.projectcalico.org/podIP: 192.168.104.66/32
                  cni.projectcalico.org/podIPs: 192.168.104.66/32
                  prometheus.io/port: 7472
                  prometheus.io/scrape: true
Status:           Running
IP:               192.168.104.66
IPs:
  IP:           192.168.104.66
Controlled By:  ReplicaSet/controller-844979dcdc
Containers:
  controller:
    Container ID:  containerd://6a218a54ce4a2c49d81d79f8ffdf6ed76ed471381de56e5001196d9a4b97ebf7
    Image:         quay.io/metallb/controller:v0.13.9
    Image ID:      quay.io/metallb/controller@sha256:c9ffd7215dcf93ff69b474c9bc5889ac69da395c62bd693110ba3b57fcecc28c
    Ports:         7472/TCP, 9443/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --port=7472
      --log-level=info
    State:          Running
      Started:      Mon, 19 Feb 2024 22:19:55 +0800
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      METALLB_ML_SECRET_NAME:  memberlist
      METALLB_DEPLOYMENT:      controller
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mqp5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
  kube-api-access-8mqp5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Warning  FailedScheduling        4m56s                 default-scheduler  0/2 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
  Normal   Scheduled               3m7s                  default-scheduler  Successfully assigned metallb-system/controller-844979dcdc-hhk5d to vhiveworker
  Warning  FailedCreatePodSandBox  2m39s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "17c4426b244715eef19caaa0a4da5fa0ebad35b4ceabea2cb15c26ebfb5ab0dd": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          2m2s (x4 over 2m39s)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 101s                  kubelet            Pulling image "quay.io/metallb/controller:v0.13.9"
  Normal   Pulled                  58s                   kubelet            Successfully pulled image "quay.io/metallb/controller:v0.13.9" in 22.5950977s (43.359467031s including waiting)
  Warning  Unhealthy               15s (x3 over 35s)     kubelet            Liveness probe failed: Get "http://192.168.104.66:7472/metrics": dial tcp 192.168.104.66:7472: connect: connection refused
  Normal   Killing                 15s                   kubelet            Container controller failed liveness probe, will be restarted
  Normal   Pulled                  12s                   kubelet            Container image "quay.io/metallb/controller:v0.13.9" already present on machine
  Warning  Unhealthy               5s (x5 over 35s)      kubelet            Readiness probe failed: Get "http://192.168.104.66:7472/metrics": dial tcp 192.168.104.66:7472: connect: connection refused
  Normal   Created                 4s (x2 over 54s)      kubelet            Created container controller
  Normal   Started                 3s (x2 over 53s)      kubelet            Started container controller

Does it mention why there is error with metallb setup? Im not very sure how to interpret this log

leokondrashov commented 8 months ago

I see several minutes of a wait due to the worker node not being ready (between the first two events). Other stuff is not that big (only the image pull that took 40s, but I have no idea how to improve that). Can you also add the output of kubectl describe node and kubectl describe deploy controller -n metallb-system?

I suppose you can try to continue the setup with ./setup_tool setup_master_node firecracker and record the similar data for failed pods: kubectl describe pod cluster-local-gateway -n istio-system and kubectl describe pod istio-ingressgateway -n istio-system if the output of istio deployment step complains about them not being ready.

ymc101 commented 8 months ago

kubectl describe node:

Name:               vhivemaster
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=vhivemaster
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.100.176.138/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.148.64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 20 Feb 2024 19:24:14 +0800
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  vhivemaster
  AcquireTime:     <unset>
  RenewTime:       Tue, 20 Feb 2024 20:18:37 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 20 Feb 2024 19:58:38 +0800   Tue, 20 Feb 2024 19:58:38 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 20 Feb 2024 20:18:21 +0800   Tue, 20 Feb 2024 19:24:14 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 20 Feb 2024 20:18:21 +0800   Tue, 20 Feb 2024 19:24:14 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 20 Feb 2024 20:18:21 +0800   Tue, 20 Feb 2024 19:24:14 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 20 Feb 2024 20:18:21 +0800   Tue, 20 Feb 2024 19:57:33 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.100.176.138
  Hostname:    vhivemaster
Capacity:
  cpu:                5
  ephemeral-storage:  102107096Ki
  hugepages-2Mi:      0
  memory:             13192552Ki
  pods:               110
Allocatable:
  cpu:                5
  ephemeral-storage:  94101899518
  hugepages-2Mi:      0
  memory:             13090152Ki
  pods:               110
System Info:
  Machine ID:                 fbeb15dcad234a4e9fa40fff05b39056
  System UUID:                98bcd0f4-c159-7547-a558-f8569d3a9b4c
  Boot ID:                    e657f05f-a8ec-4c4d-8ee5-ea232a912126
  Kernel Version:             5.15.0-94-generic
  OS Image:                   Ubuntu 20.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18
  Kubelet Version:            v1.25.9
  Kube-Proxy Version:         v1.25.9
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-xzzv7                      250m (5%)     0 (0%)      0 (0%)           0 (0%)         23m
  kube-system                 etcd-vhivemaster                       100m (2%)     0 (0%)      100Mi (0%)       0 (0%)         53m
  kube-system                 kube-apiserver-vhivemaster             250m (5%)     0 (0%)      0 (0%)           0 (0%)         53m
  kube-system                 kube-controller-manager-vhivemaster    200m (4%)     0 (0%)      0 (0%)           0 (0%)         54m
  kube-system                 kube-proxy-28mhk                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         53m
  kube-system                 kube-scheduler-vhivemaster             100m (2%)     0 (0%)      0 (0%)           0 (0%)         53m
  metallb-system              speaker-qm62w                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         21m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                900m (18%)  0 (0%)
  memory             100Mi (0%)  0 (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                   Age   From             Message
  ----     ------                   ----  ----             -------
  Normal   Starting                 53m   kube-proxy       
  Normal   Starting                 53m   kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      53m   kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  53m   kubelet          Node vhivemaster status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    53m   kubelet          Node vhivemaster status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     53m   kubelet          Node vhivemaster status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  53m   kubelet          Updated Node Allocatable limit across pods
  Normal   RegisteredNode           53m   node-controller  Node vhivemaster event: Registered Node vhivemaster in Controller
  Normal   NodeReady                21m   kubelet          Node vhivemaster status is now: NodeReady

Name:               vhiveworker
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=vhiveworker
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.100.183.218/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.104.64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 20 Feb 2024 19:54:10 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  vhiveworker
  AcquireTime:     <unset>
  RenewTime:       Tue, 20 Feb 2024 20:18:39 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 20 Feb 2024 19:58:30 +0800   Tue, 20 Feb 2024 19:58:30 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 20 Feb 2024 20:17:53 +0800   Tue, 20 Feb 2024 19:54:10 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 20 Feb 2024 20:17:53 +0800   Tue, 20 Feb 2024 19:54:10 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 20 Feb 2024 20:17:53 +0800   Tue, 20 Feb 2024 19:54:10 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 20 Feb 2024 20:17:53 +0800   Tue, 20 Feb 2024 19:57:20 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.100.183.218
  Hostname:    vhiveworker
Capacity:
  cpu:                5
  ephemeral-storage:  102107096Ki
  hugepages-2Mi:      0
  memory:             13125980Ki
  pods:               110
Allocatable:
  cpu:                5
  ephemeral-storage:  94101899518
  hugepages-2Mi:      0
  memory:             13023580Ki
  pods:               110
System Info:
  Machine ID:                 cbca3566c9694b7da50585efbf6f6d3d
  System UUID:                30faaeaf-fd56-0a41-9b2e-da93571da3af
  Boot ID:                    8235a938-4d7c-48a2-823f-c06b12bdf3d9
  Kernel Version:             5.15.0-94-generic
  OS Image:                   Ubuntu 20.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.18
  Kubelet Version:            v1.25.9
  Kube-Proxy Version:         v1.25.9
PodCIDR:                      192.168.1.0/24
PodCIDRs:                     192.168.1.0/24
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-kube-controllers-567c56ff98-ppjhv    0 (0%)        0 (0%)      0 (0%)           0 (0%)         23m
  kube-system                 calico-node-b6fls                           250m (5%)     0 (0%)      0 (0%)           0 (0%)         23m
  kube-system                 coredns-565d847f94-c6wdp                    100m (2%)     0 (0%)      70Mi (0%)        170Mi (1%)     53m
  kube-system                 coredns-565d847f94-g6kgg                    100m (2%)     0 (0%)      70Mi (0%)        170Mi (1%)     53m
  kube-system                 kube-proxy-5gm46                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         24m
  metallb-system              controller-844979dcdc-zdrmz                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         22m
  metallb-system              speaker-rx55c                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         21m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                450m (9%)   0 (0%)
  memory             140Mi (1%)  340Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  Starting                 23m                kube-proxy       
  Normal  NodeHasSufficientMemory  24m (x8 over 24m)  kubelet          Node vhiveworker status is now: NodeHasSufficientMemory
  Normal  RegisteredNode           24m                node-controller  Node vhiveworker event: Registered Node vhiveworker in Controller

kubectl describe deploy controller -n metallb-system:

Name:                   controller
Namespace:              metallb-system
CreationTimestamp:      Tue, 20 Feb 2024 19:55:45 +0800
Labels:                 app=metallb
                        component=controller
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=metallb,component=controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=metallb
                    component=controller
  Annotations:      prometheus.io/port: 7472
                    prometheus.io/scrape: true
  Service Account:  controller
  Containers:
   controller:
    Image:       quay.io/metallb/controller:v0.13.9
    Ports:       7472/TCP, 9443/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --port=7472
      --log-level=info
    Liveness:   http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:monitoring/metrics delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      METALLB_ML_SECRET_NAME:  memberlist
      METALLB_DEPLOYMENT:      controller
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
  Volumes:
   cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   controller-844979dcdc (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  24m   deployment-controller  Scaled up replica set controller-844979dcdc to 1

May I check what do you mean by continuing the setup with ./setup_tool setup_master_node firecracker ? If i rerun the setup command after the metallb failure i will encounter an index out of range error, and restarting the process with vm snapshot would just reproduce the metallb setup error, so i dont think i can get data on the istio deployment yet.

leokondrashov commented 8 months ago

As I remember, in this comment is the result of ./setup_tool setup_master_node firecracker after the metallb failure. So, it worked previously. Rerunning after that had the problem you mentioned.

For now, I don't know what's wrong with the node, it just takes more time. So, possibly, the only solution is to increase the timeout on this line to 600s just to fix this problem.

ymc101 commented 8 months ago

Ok, i will attempt increasing the timeout and try again. For the comment you referenced where it passed the metallb setup, I believe I tried running the cleanup script for single node cluster and tried running the command again. As you previously mentioned that the cleanup script was not really meant for multi-node clusters, I have stopped trying that method.

ymc101 commented 8 months ago

It seems that both the metallb error and istio error did not appear this time. Is this supposed to be the full output when successful?

[21:01:48] [Info] Set up master node
[21:01:48] [Info] Installing pod network >>>>> [21:02:08] [Success] 
[21:02:08] [Info] Installing and configuring MetalLB >>>>> [21:07:53] [Success] 
[21:07:53] [Info] Downloading istio >>>>> [21:07:57] [Success] 
[21:07:57] [Info] Extracting istio >>>>> [21:07:57] [Success] 
[21:07:58] [Info] Deploying istio operator >>>>> [21:11:21] [Success] 
[21:11:21] [Info] Installing Knative Serving component (firecracker mode) >>>>> [21:12:28] [Success] 
[21:12:28] [Info] Installing local cluster registry >>>>> [21:12:43] [Success] 
[21:12:43] [Info] Configuring Magic DNS >>>>> [21:12:52] [Success] 
[21:12:52] [Info] Deploying istio pods >>>>> [21:13:23] [Success] 
[21:13:24] [Info] Installing Knative Eventing component >>>>> [21:15:05] [Success] 
[21:15:06] [Info] Installing a default Channel (messaging) layer >>>>> [21:15:37] [Success] 
[21:15:37] [Info] Installing a Broker layer >>>>> [21:16:09] [Success] 
[21:16:09] [Info] Cleaning up temporary directory >>>>> [21:16:09] [Success] 
Every 2.0s: kubectl get pods --all-namespaces                                                                                                                  vHiveMaster: Wed Feb 21 23:04:57 2024

NAMESPACE          NAME                                       READY   STATUS      RESTARTS       AGE
istio-system       cluster-local-gateway-76bbc4bf78-jk25v     1/1     Running     0              115m
istio-system       istio-ingressgateway-dbcbdd6d5-jpxj5       1/1     Running     0              115m
istio-system       istiod-657b54846b-h4vgb                    1/1     Running     0              116m
knative-eventing   eventing-controller-6697c6d9b6-wh27j       1/1     Running     0              110m
knative-eventing   eventing-webhook-6f9cff4954-78x25          1/1     Running     0              110m
knative-eventing   imc-controller-7848bc9cdb-dqrk9            1/1     Running     0              109m
knative-eventing   imc-dispatcher-6ccc6b7db9-v8zlv            1/1     Running     0              109m
knative-eventing   mt-broker-controller-cd9b99bd5-cmfmz       1/1     Running     0              108m
knative-eventing   mt-broker-filter-cf84c449c-nwg6w           1/1     Running     0              109m
knative-eventing   mt-broker-ingress-58c4fdd87b-lql66         1/1     Running     0              109m
knative-serving    activator-64fd97c6bd-d788p                 1/1     Running     0              113m
knative-serving    autoscaler-78bd654674-cfv2v                1/1     Running     0              113m
knative-serving    controller-67fbfcfc76-w9nmx                1/1     Running     0              112m
knative-serving    default-domain-dx7zj                       0/1     Completed   0              112m
knative-serving    domain-mapping-874f6d4d8-nqnmz             1/1     Running     0              112m
knative-serving    domainmapping-webhook-67f5d487b7-8d5cr     1/1     Running     0              112m
knative-serving    net-istio-controller-7466f95bb6-nhqw4      1/1     Running     0              111m
knative-serving    net-istio-webhook-69946ffc7d-746lj         1/1     Running     0              111m
knative-serving    webhook-9bbf89ffb-f4sjh                    1/1     Running     0              112m
kube-system        calico-kube-controllers-567c56ff98-mhhrg   1/1     Running     0              122m
kube-system        calico-node-b9n62                          1/1     Running     0              122m
kube-system        calico-node-pc65c                          1/1     Running     0              122m
kube-system        coredns-565d847f94-lv4br                   1/1     Running     0              125m
kube-system        coredns-565d847f94-nqxcc                   1/1     Running     0              125m
kube-system        etcd-vhivemaster                           1/1     Running     0              125m
kube-system        kube-apiserver-vhivemaster                 1/1     Running     0              125m
kube-system        kube-controller-manager-vhivemaster        1/1     Running     0              126m
kube-system        kube-proxy-jhh2z                           1/1     Running     0              125m
kube-system        kube-proxy-p5j74                           1/1     Running     0              123m
kube-system        kube-scheduler-vhivemaster                 1/1     Running     0              126m
metallb-system     controller-844979dcdc-m6p4b                1/1     Running     1 (117m ago)   122m
metallb-system     speaker-lc8gd                              1/1     Running     0              120m
metallb-system     speaker-x47pn                              1/1     Running     0              120m
registry           docker-registry-pod-b4nxs                  1/1     Running     0              112m
registry           registry-etc-hosts-update-7kssg            1/1     Running     0              112m
leokondrashov commented 8 months ago

Yes, that is the correct setup result. It seems that these errors are just flaky; the solution is to increase MetalLB timeout and hope that Istio is installed in time. I suppose we can close the issue then.

ymc101 commented 8 months ago

Thanks for all your help. Before we close the issue, I have one more question on function deployment. According to the recorded tutorial session on youtube, there is supposed to be a deployer directory in ./examples/ to automate the deployment, but the directory seems to be missing. Can I check if there is a location to do function deployment in vHive, or is it meant to be done from vSwarm directories instead?

leokondrashov commented 8 months ago

Yes, we moved them to the vSwarm repository now. You can check with our quickstart guide. It has the most up-to-date instructions, including examples of how to use these tools.

ymc101 commented 8 months ago

Hi, could I ask a few questions regarding function deployment for my setup?

When running the deployer client, I am getting some error messages:

WARN[0602] Failed to deploy function helloworld-0, /home/vboxuser/vhive/configs/knative_workloads/helloworld.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'helloworld-0' in namespace 'default':

  2.963s The Route is still working to reflect the latest desired specification.
  5.347s Configuration "helloworld-0" is waiting for a Revision to become ready.
Error: timeout: service 'helloworld-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0602] Deployed function helloworld-0               
WARN[0602] Failed to deploy function pyaes-1, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-1' in namespace 'default':

  1.442s The Route is still working to reflect the latest desired specification.
  4.234s Configuration "pyaes-1" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-1' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0602] Deployed function pyaes-1                    
WARN[0603] Failed to deploy function pyaes-0, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-0' in namespace 'default':

  4.206s The Route is still working to reflect the latest desired specification.
  5.117s ...
  5.621s Configuration "pyaes-0" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function pyaes-0                    
WARN[0603] Failed to deploy function rnn-serving-1, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-1' in namespace 'default':

  0.751s The Route is still working to reflect the latest desired specification.
  3.778s Configuration "rnn-serving-1" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-1' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function rnn-serving-1              
WARN[0603] Failed to deploy function rnn-serving-0, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-0' in namespace 'default':

  2.567s The Route is still working to reflect the latest desired specification.
  4.244s Configuration "rnn-serving-0" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function rnn-serving-0              
WARN[1207] Failed to deploy function rnn-serving-2, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-2' in namespace 'default':

  2.081s The Route is still working to reflect the latest desired specification.
  3.126s ...
  5.313s Configuration "rnn-serving-2" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-2' not ready after 600 seconds
Run 'kn --help' for usage

Could those be ignored or is there an issue getting in the way of the deployment?

When running the invoker client I got this error regarding go versioning:

go: go.mod file indicates go 1.21, but maximum version supported by tidy is 1.19

is there a way to fix this?

Thanks.

leokondrashov commented 8 months ago

Errors are definitely bad. It shouldn't timeout. Please send over the description of the pods: kubectl describe pod helloworld-0.

The invoker problem with Go is known; we will update the Go version in the next update so it will be fixed. For now, you can reinstall go: rm -rf /usr/local/go, change the version in scripts/setup/system.json to 1.21.6 and rerun the scripts/install_go.sh

ymc101 commented 8 months ago

I got a pod not found error; i tried the describe pod command for other functions as well but it is returning the same error:

vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod helloworld-0
Error from server (NotFound): pods "helloworld-0" not found
vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod pyaes-0
Error from server (NotFound): pods "pyaes-0" not found
vboxuser@vHiveMaster:~/vswarm$ kubectl describe pod pyaes-1
Error from server (NotFound): pods "pyaes-1" not found
leokondrashov commented 8 months ago

Then describe the deployment (kubectl describe deployment helloworld-0)

ymc101 commented 8 months ago
Name:                   helloworld-0-00001-deployment
Namespace:              default
CreationTimestamp:      Tue, 27 Feb 2024 15:14:38 +0800
Labels:                 app=helloworld-0-00001
                        service.istio.io/canonical-name=helloworld-0
                        service.istio.io/canonical-revision=helloworld-0-00001
                        serving.knative.dev/configuration=helloworld-0
                        serving.knative.dev/configurationGeneration=1
                        serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
                        serving.knative.dev/revision=helloworld-0-00001
                        serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
                        serving.knative.dev/service=helloworld-0
                        serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations:            autoscaling.knative.dev/target: 1
                        deployment.kubernetes.io/revision: 1
                        serving.knative.dev/creator: kubernetes-admin
Selector:               serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
Replicas:               0 desired | 0 updated | 0 total | 0 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 25% max surge
Pod Template:
  Labels:       app=helloworld-0-00001
                service.istio.io/canonical-name=helloworld-0
                service.istio.io/canonical-revision=helloworld-0-00001
                serving.knative.dev/configuration=helloworld-0
                serving.knative.dev/configurationGeneration=1
                serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
                serving.knative.dev/revision=helloworld-0-00001
                serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
                serving.knative.dev/service=helloworld-0
                serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
  Annotations:  autoscaling.knative.dev/target: 1
                serving.knative.dev/creator: kubernetes-admin
  Containers:
   user-container:
    Image:      index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
    Port:       50051/TCP
    Host Port:  0/TCP
    Environment:
      GUEST_PORT:       50051
      GUEST_IMAGE:      ghcr.io/ease-lab/helloworld:var_workload
      PORT:             50051
      K_REVISION:       helloworld-0-00001
      K_CONFIGURATION:  helloworld-0
      K_SERVICE:        helloworld-0
    Mounts:             <none>
   queue-proxy:
    Image:       ghcr.io/vhive-serverless/queue-39be6f1d08a095bd076a71d288d295b6@sha256:41259c52c99af616fae4e7a44e40c0e90eb8f5593378a4f3de5dbf35ab1df49c
    Ports:       8022/TCP, 9090/TCP, 9091/TCP, 8013/TCP, 8112/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Requests:
      cpu:      25m
    Readiness:  http-get http://:8013/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                        default
      SERVING_SERVICE:                          helloworld-0
      SERVING_CONFIGURATION:                    helloworld-0
      SERVING_REVISION:                         helloworld-0-00001
      QUEUE_SERVING_PORT:                       8013
      QUEUE_SERVING_TLS_PORT:                   8112
      CONTAINER_CONCURRENCY:                    0
      REVISION_TIMEOUT_SECONDS:                 300
      REVISION_RESPONSE_START_TIMEOUT_SECONDS:  0
      REVISION_IDLE_TIMEOUT_SECONDS:            0
      SERVING_POD:                               (v1:metadata.name)
      SERVING_POD_IP:                            (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                   
      SERVING_LOGGING_LEVEL:                    
      SERVING_REQUEST_LOG_TEMPLATE:             {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:               false
      SERVING_REQUEST_METRICS_BACKEND:          prometheus
      TRACING_CONFIG_BACKEND:                   none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:           
      TRACING_CONFIG_DEBUG:                     false
      TRACING_CONFIG_SAMPLE_RATE:               0.1
      USER_PORT:                                50051
      SYSTEM_NAMESPACE:                         knative-serving
      METRICS_DOMAIN:                           knative.dev/internal/serving
      SERVING_READINESS_PROBE:                  {"tcpSocket":{"port":50051,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                         false
      SERVING_ENABLE_PROBE_REQUEST_LOG:         false
      METRICS_COLLECTOR_ADDRESS:                
      CONCURRENCY_STATE_ENDPOINT:               
      CONCURRENCY_STATE_TOKEN_PATH:             /var/run/secrets/tokens/state-token
      HOST_IP:                                   (v1:status.hostIP)
      ENABLE_HTTP2_AUTO_DETECTION:              false
      ROOT_CA:                                  
    Mounts:                                     <none>
  Volumes:                                      <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   helloworld-0-00001-deployment-85b6cd4698 (0/0 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  55m   deployment-controller  Scaled up replica set helloworld-0-00001-deployment-85b6cd4698 to 1
  Normal  ScalingReplicaSet  45m   deployment-controller  Scaled down replica set helloworld-0-00001-deployment-85b6cd4698 to 0 from 1
leokondrashov commented 8 months ago

Weird. It says that deployment was scaled up and down. What about revisions? The original error was about revision not being ready.

ymc101 commented 8 months ago

Sorry, what do you mean by revisions?

leokondrashov commented 8 months ago

kubectl get revisions and kubectl describe revision <name>

ymc101 commented 8 months ago
Name:         helloworld-0-00001
Namespace:    default
Labels:       serving.knative.dev/configuration=helloworld-0
              serving.knative.dev/configurationGeneration=1
              serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
              serving.knative.dev/routingState=active
              serving.knative.dev/service=helloworld-0
              serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations:  autoscaling.knative.dev/target: 1
              serving.knative.dev/creator: kubernetes-admin
              serving.knative.dev/routes: helloworld-0
              serving.knative.dev/routingStateModified: 2024-02-27T07:14:33Z
API Version:  serving.knative.dev/v1
Kind:         Revision
Metadata:
  Creation Timestamp:  2024-02-27T07:14:33Z
  Generation:          1
  Managed Fields:
    API Version:  serving.knative.dev/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:autoscaling.knative.dev/target:
          f:serving.knative.dev/creator:
          f:serving.knative.dev/routes:
          f:serving.knative.dev/routingStateModified:
        f:labels:
          .:
          f:serving.knative.dev/configuration:
          f:serving.knative.dev/configurationGeneration:
          f:serving.knative.dev/configurationUID:
          f:serving.knative.dev/routingState:
          f:serving.knative.dev/service:
          f:serving.knative.dev/serviceUID:
        f:ownerReferences:
          .:
          k:{"uid":"36b65317-e523-4ec3-8ea6-8734ebdf4d7b"}:
      f:spec:
        .:
        f:containerConcurrency:
        f:containers:
        f:enableServiceLinks:
        f:timeoutSeconds:
    Manager:      controller
    Operation:    Update
    Time:         2024-02-27T07:14:33Z
    API Version:  serving.knative.dev/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:actualReplicas:
        f:conditions:
        f:containerStatuses:
        f:observedGeneration:
    Manager:      controller
    Operation:    Update
    Subresource:  status
    Time:         2024-02-27T07:25:29Z
  Owner References:
    API Version:           serving.knative.dev/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Configuration
    Name:                  helloworld-0
    UID:                   36b65317-e523-4ec3-8ea6-8734ebdf4d7b
  Resource Version:        24730
  UID:                     933839c6-a4fd-4bcf-907b-725a455a2503
Spec:
  Container Concurrency:  0
  Containers:
    Env:
      Name:   GUEST_PORT
      Value:  50051
      Name:   GUEST_IMAGE
      Value:  ghcr.io/ease-lab/helloworld:var_workload
    Image:    crccheck/hello-world:latest
    Name:     user-container
    Ports:
      Container Port:  50051
      Name:            h2c
      Protocol:        TCP
    Readiness Probe:
      Success Threshold:  1
      Tcp Socket:
        Port:  0
    Resources:
  Enable Service Links:  false
  Timeout Seconds:       300
Status:
  Actual Replicas:  0
  Conditions:
    Last Transition Time:  2024-02-27T07:25:29Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Severity:              Info
    Status:                False
    Type:                  Active
    Last Transition Time:  2024-02-27T07:14:40Z
    Reason:                Deploying
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2024-02-27T07:24:50Z
    Message:               Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
    Reason:                CreateContainerError
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-27T07:24:50Z
    Message:               Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
    Reason:                CreateContainerError
    Status:                False
    Type:                  ResourcesAvailable
  Container Statuses:
    Image Digest:       index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
    Name:               user-container
  Observed Generation:  1
Events:                 <none>
leokondrashov commented 8 months ago

I've never seen such errors: "Failed to get/pull image: failed to prepare extraction snapshot". Please open a separate issue and attach firecracker logs from worker nodes. It seems that it is the issue with Firecracker now.

ymc101 commented 8 months ago

Is there a command or file location i can access the firecrackers logs from in the worker node?