vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
290 stars 90 forks source link

Firecracker deployment is broken #781

Closed leokondrashov closed 1 year ago

leokondrashov commented 1 year ago

Describe the bug Firecracker-containerd deployment option is not working.

This shows itself as calico pods in non-ready state (ready 0/1 pods) and failed liveness probe. Some other pods are also not ready (default domain has not succeeded, registry is in constant CrashLoopBackOff)

To Reproduce Quickstart guide, firecracker deployment, cloudlab xl170 or d430 2-node cluster.

Commands:

# both nodes
git clone --depth=1 https://github.com/vhive-serverless/vhive.git
cd vhive
mkdir -p /tmp/vhive-logs
./scripts/cloudlab/setup_node.sh > >(tee -a /tmp/vhive-logs/setup_node.stdout) 2> >(tee -a /tmp/vhive-logs/setup_node.stderr >&2)

# for worker
./scripts/cluster/setup_worker_kubelet.sh > >(tee -a /tmp/vhive-logs/setup_worker_kubelet.stdout) 2> >(tee -a /tmp/vhive-logs/setup_worker_kubelet.stderr >&2)
sudo screen -dmS containerd bash -c "containerd > >(tee -a /tmp/vhive-logs/containerd.stdout) 2> >(tee -a /tmp/vhive-logs/containerd.stderr >&2)"
sudo PATH=$PATH screen -dmS firecracker bash -c "/usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml > >(tee -a /tmp/vhive-logs/firecracker.stdout) 2> >(tee -a /tmp/vhive-logs/firecracker.stderr >&2)"
source /etc/profile && go build
sudo screen -dmS vhive bash -c "./vhive > >(tee -a /tmp/vhive-logs/vhive.stdout) 2> >(tee -a /tmp/vhive-logs/vhive.stderr >&2)"

# for master
sudo screen -dmS containerd bash -c "containerd > >(tee -a /tmp/vhive-logs/containerd.stdout) 2> >(tee -a /tmp/vhive-logs/containerd.stderr >&2)"
./scripts/cluster/create_multinode_cluster.sh > >(tee -a /tmp/vhive-logs/create_multinode_cluster.stdout) 2> >(tee -a /tmp/vhive-logs/create_multinode_cluster.stderr >&2)

# Join cluster from worker, 'y' to master node

Expected behavior Have working setup: all pods are ready.

Logs

$ kubectl get pods -A
NAMESPACE          NAME                                                             READY   STATUS                  RESTARTS        AGE
istio-system       cluster-local-gateway-fffb9f589-9t279                            0/1     Running                 0               14m
istio-system       istio-ingressgateway-778db64bb6-l9bsx                            0/1     Running                 0               14m
istio-system       istiod-85bf857c79-wgjgt                                          1/1     Running                 0               14m
knative-eventing   eventing-controller-6b5b744bfd-hnjbl                             1/1     Running                 0               8m48s
knative-eventing   eventing-webhook-75cdd7c68-5vfxf                                 1/1     Running                 0               8m48s
knative-eventing   imc-controller-565df566f8-s2hjb                                  1/1     Running                 0               8m46s
knative-eventing   imc-dispatcher-5bf6c7d945-2msnj                                  1/1     Running                 0               8m46s
knative-eventing   mt-broker-controller-575d4c9f77-hzc4r                            1/1     Running                 0               8m43s
knative-eventing   mt-broker-filter-746ddf5785-wqqgm                                1/1     Running                 0               8m43s
knative-eventing   mt-broker-ingress-7bff548b5b-tcx77                               1/1     Running                 0               8m43s
knative-serving    activator-64fd97c6bd-wpxvp                                       0/1     Running                 2 (2m58s ago)   8m55s
knative-serving    autoscaler-78bd654674-tmfzj                                      1/1     Running                 0               8m55s
knative-serving    controller-67fbfcfc76-tdmqx                                      1/1     Running                 0               8m55s
knative-serving    default-domain-49wns                                             0/1     Error                   0               6m7s
knative-serving    default-domain-6zb52                                             0/1     Error                   0               7m4s
knative-serving    default-domain-882d2                                             0/1     Error                   0               8m51s
knative-serving    default-domain-8mr9f                                             0/1     Error                   0               6m50s
knative-serving    default-domain-9bd4f                                             0/1     Error                   0               8m14s
knative-serving    default-domain-bntmd                                             0/1     Error                   0               6m21s
knative-serving    default-domain-fnjnr                                             0/1     Error                   0               8m
knative-serving    default-domain-hqm7s                                             0/1     Error                   0               6m35s
knative-serving    default-domain-pqmnh                                             0/1     Error                   0               7m46s
knative-serving    default-domain-vjsxz                                             0/1     Error                   0               7m18s
knative-serving    default-domain-vq9xt                                             0/1     Error                   0               7m32s
knative-serving    domain-mapping-874f6d4d8-s98gg                                   1/1     Running                 0               8m55s
knative-serving    domainmapping-webhook-67f5d487b7-plr6f                           1/1     Running                 0               8m55s
knative-serving    net-istio-controller-777b6b4d89-qkvp9                            1/1     Running                 0               8m50s
knative-serving    net-istio-webhook-78665d59fd-4ndns                               1/1     Running                 0               8m50s
knative-serving    webhook-9bbf89ffb-mzq4b                                          1/1     Running                 0               8m55s
kube-system        calico-kube-controllers-567c56ff98-kjbjj                         1/1     Running                 0               15m
kube-system        calico-node-fd2vw                                                0/1     Running                 0               15m
kube-system        calico-node-knxsq                                                0/1     Running                 0               15m
kube-system        coredns-565d847f94-5qbbn                                         1/1     Running                 0               15m
kube-system        coredns-565d847f94-bw5hw                                         1/1     Running                 0               15m
kube-system        etcd-node-0.vhive-test.ntu-cloud.emulab.net                      1/1     Running                 0               15m
kube-system        kube-apiserver-node-0.vhive-test.ntu-cloud.emulab.net            1/1     Running                 0               15m
kube-system        kube-controller-manager-node-0.vhive-test.ntu-cloud.emulab.net   1/1     Running                 0               15m
kube-system        kube-proxy-b82b4                                                 1/1     Running                 0               15m
kube-system        kube-proxy-bf5x2                                                 1/1     Running                 0               15m
kube-system        kube-scheduler-node-0.vhive-test.ntu-cloud.emulab.net            1/1     Running                 0               15m
metallb-system     controller-844979dcdc-2xvz2                                      1/1     Running                 0               15m
metallb-system     speaker-nskhz                                                    1/1     Running                 0               15m
metallb-system     speaker-w859x                                                    1/1     Running                 0               15m
registry           docker-registry-pod-hfszr                                        1/1     Running                 0               8m52s
registry           registry-etc-hosts-update-r4tjs                                  0/1     Init:CrashLoopBackOff   5 (91s ago)     8m51s
kubectl describe pod calico-node-fd2vw -n kube-system | tail -40
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  19m   default-scheduler  Successfully assigned kube-system/calico-node-fd2vw to node-1.vhive-test.ntu-cloud.emulab.net
  Normal   Pulling    19m   kubelet            Pulling image "docker.io/calico/cni:v3.25.1"
  Normal   Pulled     19m   kubelet            Successfully pulled image "docker.io/calico/cni:v3.25.1" in 6.609745176s
  Normal   Created    19m   kubelet            Created container upgrade-ipam
  Normal   Started    19m   kubelet            Started container upgrade-ipam
  Normal   Pulled     19m   kubelet            Container image "docker.io/calico/cni:v3.25.1" already present on machine
  Normal   Created    19m   kubelet            Created container install-cni
  Normal   Started    19m   kubelet            Started container install-cni
  Normal   Pulling    19m   kubelet            Pulling image "docker.io/calico/node:v3.25.1"
  Normal   Pulled     19m   kubelet            Successfully pulled image "docker.io/calico/node:v3.25.1" in 8.338422707s
  Normal   Created    19m   kubelet            Created container mount-bpffs
  Normal   Started    19m   kubelet            Started container mount-bpffs
  Normal   Pulled     19m   kubelet            Container image "docker.io/calico/node:v3.25.1" already present on machine
  Normal   Created    19m   kubelet            Created container calico-node
  Normal   Started    19m   kubelet            Started container calico-node
  Warning  Unhealthy  19m   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
  Warning  Unhealthy  19m   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  19m   kubelet            Readiness probe failed: 2023-08-01 06:55:59.478 [INFO][363] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:09.489 [INFO][431] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:19.522 [INFO][491] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:29.495 [INFO][541] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:39.527 [INFO][609] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:49.503 [INFO][656] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  18m  kubelet  Readiness probe failed: 2023-08-01 06:56:59.506 [INFO][746] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1
  Warning  Unhealthy  4m32s (x93 over 17m)  kubelet  (combined from similar events): Readiness probe failed: 2023-08-01 07:10:29.486 [INFO][6011] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.1.1

create_multinode_cluster.stderr:

W0801 00:54:16.997669   30229 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
I0801 00:54:17.209749   30229 version.go:256] remote version is much newer: v1.27.4; falling back to: stable-1.25
All nodes need to be joined in the cluster. Have you joined all nodes? (y/n): All nodes need to be joined in the cluster. Have you joined all nodes? (y/n): Warning: resource configmaps/kube-proxy is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
Error from server (InternalError): error when creating "/users/lkondras/vhive/configs/metallb/metallb-ipaddresspool.yaml": Internal error occurred: failed calling webhook "ipaddresspoolvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-ipaddresspool?timeout=10s": context deadline exceeded
Error from server (InternalError): error when creating "/users/lkondras/vhive/configs/metallb/metallb-l2advertisement.yaml": Internal error occurred: failed calling webhook "l2advertisementvalidationwebhook.metallb.io": failed to call webhook: Post "https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s": context deadline exceeded
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100   101  100   101    0     0    848      0 --:--:-- --:--:-- --:--:--   848
^M100  4899  100  4899    0     0  12036      0 --:--:-- --:--:-- --:--:-- 12036
! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See https://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead

- Processing resources for Istio core.
✔ Istio core installed
- Processing resources for Istiod.
- Processing resources for Istiod. Waiting for Deployment/istio-system/istiod
✔ Istiod installed
- Processing resources for Ingress gateways.
- Processing resources for Ingress gateways. Waiting for Deployment/istio-system/cluster-local-gateway, Deployment/istio-system/istio-ingressgateway
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
  Deployment/istio-system/cluster-local-gateway (containers with unready status: [istio-proxy])
  Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resourcesError: failed to install manifests: errors occurred during operation

Notes Setup works for pure containerd deployment (stock-only passed to setup scripts).

leokondrashov commented 1 year ago

For xl170 nodes, the problem was fixed with sed -i '4548i\ - name: IP_AUTODETECTION_METHOD\n value: "interface=ens1f1"' configs/calico/canal.yaml before running ./scripts/cluster/create_multinode_cluster.sh on master node. So, the problem is in the choice of interface for calico.