vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
279 stars 86 forks source link

required key GUEST_ADDR missing value #694

Closed iamhieupc closed 7 months ago

iamhieupc commented 1 year ago

when i deploy service by file yaml: apiVersion: serving.knative.dev/v1 kind: Service metadata: name: aes-python namespace: default spec: template: spec: containers:

ustiugov commented 1 year ago

@hustchihieu143 please provide all relevant logs, info about your platform, and steps to reproduce

leokondrashov commented 1 year ago

@hustchihieu143 Is the issue still relevant?

ustiugov commented 1 year ago

please re-open if still relevant.

yihan1998 commented 7 months ago

I encountered the same issue when I was trying to deploy on CloudLab servers. The commands and output of my setup process are as followed:

Host platform

Two c220g2 nodes on CloudLab with Intel x86 CPUs

vHive Setup Process

Commands executed on both master and worker node

# Clone vHive repo
git clone --depth=1 https://github.com/vhive-serverless/vhive.git
# Create a directory for vHive logs
mkdir -p /tmp/vhive-logs/
# Enter the clone vHive repo
cd vHive
# Install Go via scripts
./scripts/install_go.sh; source /etc/profile
# Get the setup scripts
pushd scripts && go build -o setup_tool && popd && mv scripts/setup_tool .
# Run the standard node setup script (with firecracker)
./setup_tool setup_node firecracker

Setup worker nodes

# Run the script that setups kubelet (with firecracker)
./setup_tool setup_worker_kubelet firecracker
# Start containerd
sudo screen -dmS containerd bash -c "containerd > >(tee -a /tmp/vhive-logs/containerd.stdout) 2> >(tee -a /tmp/vhive-logs/containerd.stderr >&2)"
# Start firecracker-containerd
sudo PATH=$PATH screen -dmS firecracker bash -c "/usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml > >(tee -a /tmp/vhive-logs/firecracker.stdout) 2> >(tee -a /tmp/vhive-logs/firecracker.stderr >&2)"
# Build vHive host orchestrator
source /etc/profile && go build
# Start vHive in a background terminal named vhive
sudo screen -dmS vhive bash -c "./vhive > >(tee -a /tmp/vhive-logs/vhive.stdout) 2> >(tee -a /tmp/vhive-logs/vhive.stderr >&2)"

Configure master node

# Start containerd
sudo screen -dmS containerd bash -c "containerd > >(tee -a /tmp/vhive-logs/containerd.stdout) 2> >(tee -a /tmp/vhive-logs/containerd.stderr >&2)"
# Run the script that creates the multinode cluster (with firecracker)
./setup_tool create_multinode_cluster firecracker

Configure worker nodes

# Use the masterKey.yaml to connect worker nodes to the master node
sudo kubeadm join <IP>:<PORT> --token <Token> --discovery-token-ca-cert-hash <Token Hash> > >(tee -a /tmp/vhive-logs/kubeadm_join.stdout) 2> >(tee -a /tmp/vhive-logs/kubeadm_join.stderr >&2)

Output on worker nodes after connection:

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Finalise Master Node

  1. Answer with y to the prompt in the terminal
  2. Wait until all pods show as Running or Completed deployment_state

Deploying and Invoking Functions in vHive

Deploy functions on master node

# Clone vSwarm repo
git clone --depth=1 --branch=v1.0 https://github.com/vhive-serverless/vSwarm ../vswarm
# Enter vSwarm repo
cd ../vswarm
# Change tools/deployer/functions.json to only deploy helloworld
# Deploy helloworld
source /etc/profile && pushd ./tools/deployer && go build && popd && ./tools/deployer/deployer -funcPath ~/vhive/configs/knative_workloads/

Phenomenon

The status of helloworld loops between Error and CrashLoopBackOff

Reported error in log

error

I also tried to add a GUEST_ADDR field with value 10.0.1.1 (which is the same as the master IP address in masterKey.yaml) in configs/knative_workloads/helloworld.yaml, but the error still lingers, Is there anything I miss to configure?

ustiugov commented 7 months ago

@JooyoungPark73 please check the following flow when troubleshooting.

  1. the vHive process uses the networking module to create a virtual tap device and assign an IP address to it.
  2. the vHive process sets GUEST_ADDR env variable inside the pod of the created instance
  3. the QP container retrieves the env var’s value from its shell environment and passes it to the CreateVM call of the cri/ctriface functions

the right way to debug is to look at the vHive logs (debug level) to check the first 2 steps. To check the third step, one needs to look at QP container’s logs using kubectl

@lrq619 I wonder why we don't encounter this problem when running the CRI tests. please discuss with JooYoung

lrq619 commented 7 months ago

@JooyoungPark73 please check the following flow when troubleshooting.

  1. the vHive process uses the networking module to create a virtual tap device and assign an IP address to it.
  2. the vHive process sets GUEST_ADDR env variable inside the pod of the created instance
  3. the QP container retrieves the env var’s value from its shell environment and passes it to the CreateVM call of the cri/ctriface functions

the right way to debug is to look at the vHive logs (debug level) to check the first 2 steps. To check the third step, one needs to look at QP container’s logs using kubectl

@lrq619 I wonder why we don't encounter this problem when running the CRI tests. please discuss with JooYoung

Discussed with JooYoung, I think the difference might be caused by the fact that our CRI tests runs on a single-node cluster while the issue here is on a 2-node cluster.

JooyoungPark73 commented 7 months ago

containerd log

time="2024-01-26T02:32:27.590352081-07:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:helloworld-0-00001-deployment-77665cb678-5wphk,Uid:438db12e-dd02-4c51-aa2c-5db15259fab1,Namespace:default,Attempt:0,}"
2024-01-26 02:32:27.662 [INFO][206026] plugin.go 327: Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0 helloworld-0-00001-deployment-77665cb678- default  438db12e-dd02-4c51-aa2c-5db15259fab1 92722 0 2024-01-26 02:32:27 -0700 MST <nil> <nil> map[app:helloworld-0-00001 pod-template-hash:77665cb678 projectcalico.org/namespace:default projectcalico.org/orchestrator:k8s projectcalico.org/serviceaccount:default service.istio.io/canonical-name:helloworld-0 service.istio.io/canonical-revision:helloworld-0-00001 serving.knative.dev/configuration:helloworld-0 serving.knative.dev/configurationGeneration:1 serving.knative.dev/configurationUID:abd26e43-5d72-4397-add2-7b00450639b1 serving.knative.dev/revision:helloworld-0-00001 serving.knative.dev/revisionUID:0f8f90c4-6ae8-4e7e-8c3f-1188e54aa91e serving.knative.dev/service:helloworld-0 serving.knative.dev/serviceUID:0cfd9405-13f5-4607-8038-cf8db48df7ff] map[] [] [] []} {k8s  node-1.issue-694.ntu-cloud.emulab.net  helloworld-0-00001-deployment-77665cb678-5wphk eth0 default [] []   [kns.default ksa.default.default] cali327f1fc2001  [{user-port TCP 50051 0 } {http-queueadm TCP 8022 0 } {http-autometric TCP 9090 0 } {http-usermetric TCP 9091 0 } {queue-port TCP 8013 0 } {https-port TCP 8112 0 }] []}} ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-"
2024-01-26 02:32:27.662 [INFO][206026] k8s.go 74: Extracted identifiers for CmdAddK8s ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
2024-01-26 02:32:27.713 [INFO][206055] ipam_plugin.go 229: Calico CNI IPAM request count IPv4=1 IPv6=0 ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" HandleID="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Workload="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
2024-01-26 02:32:27.731 [INFO][206055] ipam_plugin.go 269: Auto assigning IP ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" HandleID="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Workload="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0" assignArgs=ipam.AutoAssignArgs{Num4:1, Num6:0, HandleID:(*string)(0xc000024550), Attrs:map[string]string{"namespace":"default", "node":"node-1.issue-694.ntu-cloud.emulab.net", "pod":"helloworld-0-00001-deployment-77665cb678-5wphk", "timestamp":"2024-01-26 09:32:27.713382481 +0000 UTC"}, Hostname:"node-1.issue-694.ntu-cloud.emulab.net", IPv4Pools:[]net.IPNet{}, IPv6Pools:[]net.IPNet{}, MaxBlocksPerHost:0, HostReservedAttrIPv4s:(*ipam.HostReservedAttr)(nil), HostReservedAttrIPv6s:(*ipam.HostReservedAttr)(nil), IntendedUse:"Workload"}
time="2024-01-26T02:32:27-07:00" level=info msg="About to acquire host-wide IPAM lock." source="ipam_plugin.go:357"
time="2024-01-26T02:32:27-07:00" level=info msg="Acquired host-wide IPAM lock." source="ipam_plugin.go:372"
2024-01-26 02:32:27.731 [INFO][206055] ipam.go 107: Auto-assign 1 ipv4, 0 ipv6 addrs for host 'node-1.issue-694.ntu-cloud.emulab.net'
2024-01-26 02:32:27.734 [INFO][206055] ipam.go 660: Looking up existing affinities for host handle="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.741 [INFO][206055] ipam.go 372: Looking up existing affinities for host host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.748 [INFO][206055] ipam.go 489: Trying affinity for 192.168.222.64/26 host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.751 [INFO][206055] ipam.go 155: Attempting to load block cidr=192.168.222.64/26 host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.757 [INFO][206055] ipam.go 232: Affinity is confirmed and block has been loaded cidr=192.168.222.64/26 host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.757 [INFO][206055] ipam.go 1180: Attempting to assign 1 addresses from block block=192.168.222.64/26 handle="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.760 [INFO][206055] ipam.go 1682: Creating new handle: k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e
2024-01-26 02:32:27.766 [INFO][206055] ipam.go 1203: Writing block in order to claim IPs block=192.168.222.64/26 handle="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.780 [INFO][206055] ipam.go 1216: Successfully claimed IPs: [192.168.222.101/26] block=192.168.222.64/26 handle="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" host="node-1.issue-694.ntu-cloud.emulab.net"
2024-01-26 02:32:27.781 [INFO][206055] ipam.go 847: Auto-assigned 1 out of 1 IPv4s: [192.168.222.101/26] handle="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" host="node-1.issue-694.ntu-cloud.emulab.net"
time="2024-01-26T02:32:27-07:00" level=info msg="Released host-wide IPAM lock." source="ipam_plugin.go:378"
2024-01-26 02:32:27.781 [INFO][206055] ipam_plugin.go 287: Calico CNI IPAM assigned addresses IPv4=[192.168.222.101/26] IPv6=[] ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" HandleID="k8s-pod-network.e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Workload="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
2024-01-26 02:32:27.785 [INFO][206026] ="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0", GenerateName:"helloworld-0-00001-deployment-77665cb678-", Namespace:"default", SelfLink:"", UID:"438db12e-dd02-4c51-aa2c-5db15259fab1", ResourceVersion:"92722", Generation:0, CreationTimestamp:time.Date(2024, time.January, 26, 2, 32, 27, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"helloworld-0-00001", "pod-template-hash":"77665cb678", "projectcalico.org/namespace":"default", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default", "service.istio.io/canonical-name":"helloworld-0", "service.istio.io/canonical-revision":"helloworld-0-00001", "serving.knative.dev/configuration":"helloworld-0", "serving.knative.dev/configurationGeneration":"1", "serving.knative.dev/configurationUID":"abd26e43-5d72-4397-add2-7b00450639b1", "serving.knative.dev/revision":"helloworld-0-00001", "serving.knative.dev/revisionUID":"0f8f90c4-6ae8-4e7e-8c3f-1188e54aa91e", "serving.knative.dev/service":"helloworld-0", "serving.knative.dev/serviceUID":"0cfd9405-13f5-4607-8038-cf8db48df7ff"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"node-1.issue-694.ntu-cloud.emulab.net", ContainerID:"", Pod:"helloworld-0-00001-deployment-77665cb678-5wphk", Endpoint:"eth0", ServiceAccountName:"default", IPNetworks:[]string{"192.168.222.101/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.default", "ksa.default.default"}, InterfaceName:"cali327f1fc2001", MAC:"", Ports:[]v3.WorkloadEndpointPort{v3.WorkloadEndpointPort{Name:"user-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0xc383, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-queueadm", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1f56, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-autometric", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x2382, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-usermetric", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x2383, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"queue-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1f4d, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"https-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1fb0, HostPort:0x0, HostIP:""}}, AllowSpoofedSourcePrefixes:[]string(nil)}}
2024-01-26 02:32:27.785 [INFO][206026] k8s.go 384: Calico CNI using IPs: [192.168.222.101/32] ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
2024-01-26 02:32:27.785 [INFO][206026] dataplane_linux.go 68: Setting the host side veth name to cali327f1fc2001 ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
2024-01-26 02:32:27.788 [INFO][206026] dataplane_linux.go 473: Disabling IPv4 forwarding ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"

2024-01-26 02:32:27.833 [INFO][206026] k8s.go 411: Added Mac, interface name, and active container ID to endpoint ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0", GenerateName:"helloworld-0-00001-deployment-77665cb678-", Namespace:"default", SelfLink:"", UID:"438db12e-dd02-4c51-aa2c-5db15259fab1", ResourceVersion:"92722", Generation:0, CreationTimestamp:time.Date(2024, time.January, 26, 2, 32, 27, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"helloworld-0-00001", "pod-template-hash":"77665cb678", "projectcalico.org/namespace":"default", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default", "service.istio.io/canonical-name":"helloworld-0", "service.istio.io/canonical-revision":"helloworld-0-00001", "serving.knative.dev/configuration":"helloworld-0", "serving.knative.dev/configurationGeneration":"1", "serving.knative.dev/configurationUID":"abd26e43-5d72-4397-add2-7b00450639b1", "serving.knative.dev/revision":"helloworld-0-00001", "serving.knative.dev/revisionUID":"0f8f90c4-6ae8-4e7e-8c3f-1188e54aa91e", "serving.knative.dev/service":"helloworld-0", "serving.knative.dev/serviceUID":"0cfd9405-13f5-4607-8038-cf8db48df7ff"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"node-1.issue-694.ntu-cloud.emulab.net", ContainerID:"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e", Pod:"helloworld-0-00001-deployment-77665cb678-5wphk", Endpoint:"eth0", ServiceAccountName:"default", IPNetworks:[]string{"192.168.222.101/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.default", "ksa.default.default"}, InterfaceName:"cali327f1fc2001", MAC:"ae:08:39:ad:b4:e0", Ports:[]v3.WorkloadEndpointPort{v3.WorkloadEndpointPort{Name:"user-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0xc383, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-queueadm", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1f56, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-autometric", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x2382, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"http-usermetric", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x2383, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"queue-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1f4d, HostPort:0x0, HostIP:""}, v3.WorkloadEndpointPort{Name:"https-port", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x1fb0, HostPort:0x0, HostIP:""}}, AllowSpoofedSourcePrefixes:[]string(nil)}}
2024-01-26 02:32:27.850 [INFO][206026] k8s.go 489: Wrote updated endpoint to datastore ContainerID="e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e" Namespace="default" Pod="helloworld-0-00001-deployment-77665cb678-5wphk" WorkloadEndpoint="node--1.issue--694.ntu--cloud.emulab.net-k8s-helloworld--0--00001--deployment--77665cb678--5wphk-eth0"
time="2024-01-26T02:32:27.873474937-07:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2024-01-26T02:32:27.873638893-07:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2024-01-26T02:32:27.873673567-07:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-01-26T02:32:27.874027820-07:00" level=info msg="starting signal loop" namespace=k8s.io path=/run/containerd/io.containerd.runtime.v2.task/k8s.io/e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e pid=206102 runtime=io.containerd.runc.v2
time="2024-01-26T02:32:27.980242536-07:00" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:helloworld-0-00001-deployment-77665cb678-5wphk,Uid:438db12e-dd02-4c51-aa2c-5db15259fab1,Namespace:default,Attempt:0,} returns sandbox id \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\""
time="2024-01-26T02:32:27.984153292-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for container &ContainerMetadata{Name:user-container,Attempt:0,}"
time="2024-01-26T02:32:27.994925994-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for &ContainerMetadata{Name:user-container,Attempt:0,} returns container id \"49f3d6b7249fbe056887cf7a9698f70c27340cd7fef4b5686d76fe69fac93f31\""
time="2024-01-26T02:32:27.995505272-07:00" level=info msg="StartContainer for \"49f3d6b7249fbe056887cf7a9698f70c27340cd7fef4b5686d76fe69fac93f31\""
time="2024-01-26T02:32:28.091302057-07:00" level=info msg="StartContainer for \"49f3d6b7249fbe056887cf7a9698f70c27340cd7fef4b5686d76fe69fac93f31\" returns successfully"
time="2024-01-26T02:32:28.095419461-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for container &ContainerMetadata{Name:queue-proxy,Attempt:0,}"
time="2024-01-26T02:32:28.105626233-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for &ContainerMetadata{Name:queue-proxy,Attempt:0,} returns container id \"e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2\""
time="2024-01-26T02:32:28.106137072-07:00" level=info msg="StartContainer for \"e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2\""
time="2024-01-26T02:32:28.206664344-07:00" level=info msg="StartContainer for \"e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2\" returns successfully"
time="2024-01-26T02:32:28.258634771-07:00" level=info msg="shim disconnected" id=e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2
time="2024-01-26T02:32:28.258716686-07:00" level=warning msg="cleaning up after shim disconnected" id=e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2 namespace=k8s.io
time="2024-01-26T02:32:28.258738468-07:00" level=info msg="cleaning up dead shim"
time="2024-01-26T02:32:28.274822512-07:00" level=warning msg="cleanup warnings time=\"2024-01-26T02:32:28-07:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=206230 runtime=io.containerd.runc.v2\n"
time="2024-01-26T02:32:28.848673392-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for container &ContainerMetadata{Name:queue-proxy,Attempt:1,}"
time="2024-01-26T02:32:28.857456924-07:00" level=info msg="CreateContainer within sandbox \"e097406be795cf36ab808b9fcc1c7631bd864dcfe86a3ae2efca6d187742474e\" for &ContainerMetadata{Name:queue-proxy,Attempt:1,} returns container id \"2df92d746d6399be2146aae7aa995d3d9888910ee2fa9261bca7ac3608e4378d\""
time="2024-01-26T02:32:28.858022175-07:00" level=info msg="StartContainer for \"2df92d746d6399be2146aae7aa995d3d9888910ee2fa9261bca7ac3608e4378d\""
time="2024-01-26T02:32:28.946321148-07:00" level=info msg="StartContainer for \"2df92d746d6399be2146aae7aa995d3d9888910ee2fa9261bca7ac3608e4378d\" returns successfully"
time="2024-01-26T02:32:28.986701354-07:00" level=info msg="shim disconnected" id=2df92d746d6399be2146aae7aa995d3d9888910ee2fa9261bca7ac3608e4378d
time="2024-01-26T02:32:28.986783309-07:00" level=warning msg="cleaning up after shim disconnected" id=2df92d746d6399be2146aae7aa995d3d9888910ee2fa9261bca7ac3608e4378d namespace=k8s.io
time="2024-01-26T02:32:28.986804597-07:00" level=info msg="cleaning up dead shim"
time="2024-01-26T02:32:29.002706430-07:00" level=warning msg="cleanup warnings time=\"2024-01-26T02:32:28-07:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=206324 runtime=io.containerd.runc.v2\n"
time="2024-01-26T02:32:29.853389244-07:00" level=info msg="RemoveContainer for \"e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2\""
time="2024-01-26T02:32:29.857671513-07:00" level=info msg="RemoveContainer for \"e16be82b8c93bc0fcac6d948e78d78e63ed851709b0c7b9ca77029a3ac2d21e2\" returns successfully"

No logs for vHive and Firecracker

JooyoungPark73 commented 7 months ago

I think I found part 1,

ipam.go 1216: Successfully claimed IPs: [192.168.222.101/26] ...
ipam.go 847: Auto-assigned 1 out of 1 IPv4s: [192.168.222.101/26] ...
...
k8s.go 383: Populated endpoint ContainerID= ...
k8s.go 411: Added Mac, interface name, and active container ID to endpoint ContainerID= ...

But did not find step 2. and then the log goes:

level=info msg="StartContainer for [Proxy Container] returns successfully"
level=info msg="shim disconnected" id=[Proxy Container]
level=warning msg="cleaning up after shim disconnected" id=[Proxy Container] namespace=k8s.io

I changed container names for readability.

lrq619 commented 7 months ago

After executing:

sudo mv /var/lib/firecracker-containerd/runtime/vmlinux-5.10.186 /var/lib/firecracker-containerd/runtime/hello-vmlinux.bin
sudo chmod +x /var/lib/firecracker-containerd/runtime/hello-vmlinux.bin

The workflow on single cluster node works properly, below are the logs for vhive:

INFO[2024-01-26T06:29:43.765891652-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:29:43.910730096-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
INFO[2024-01-26T06:30:43.765670957-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:30:43.911008195-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
INFO[2024-01-26T06:31:43.766150454-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:31:43.910439769-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
INFO[2024-01-26T06:32:43.765935983-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:32:43.911296042-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
INFO[2024-01-26T06:33:43.765688754-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:33:43.910918952-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
INFO[2024-01-26T06:34:43.765625713-06:00] HEARTBEAT: number of active VMs: 0
INFO[2024-01-26T06:34:43.910639986-06:00] FuncPool heartbeat: ==== Stats by cold functions ====
fID, #started, #served
===================================
JooyoungPark73 commented 7 months ago

@CuriousGeorgiy Can you take a look at this issue, and give some comments? Maybe some pointers?

CuriousGeorgiy commented 7 months ago

@JooyoungPark73 as far as I remember, we did test the multinode setup when I was working on #816.

Regarding https://github.com/vhive-serverless/vHive/issues/694#issuecomment-1912005945: I have also had problems with the basic kernel binary (hello-vmlinux) when creating a fresh default-rootfs image for SSH experiments, so it's definitely worth updating.

AFAIC, GUEST_ADDR is set here: https://github.com/vhive-serverless/vHive/blob/7e82cb132bca2b61b8bd57e2466c604cc6df6078/cri/firecracker/service.go#L163-L168

Not sure what could go wrong here, but I would add a log message here to make sure all the environment variables are present and trace the root cause.

Other than that I haven't really worked a lot with or debugged the k8s setup (since I almost always had setup issues in cloudlab), so I can't really help you out here.

ustiugov commented 7 months ago

@JooyoungPark73 we need to give this fix a priority now