vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
288 stars 89 forks source link

Google Compute Engine setup issues #158

Closed amohoste closed 3 years ago

amohoste commented 3 years ago

I tried to deploy vHive on Google Compute Engine instead of bare metal but ran into some errors which are most likely network related. The setup included one master noted and two worker nodes, but I also ran into the same issues when using only one worker node. The setup includes 3 Ubuntu 18.04 VMs with nested virtualisation enabled on 10.240.0.0/24. Each node was allocated 2 vCPUs, 8GB RAM, 50GB SSD and the microarchitecture was made sure to be Haswell or later to support nested virtualisation. Internally in the subnet all tcp, ump, icmp and pip traffic was allowed and externally the following ports were opened: tcp:22,tcp:6443,icmp,udp:8472,udp:4789.

Logs

master_containerd.log

worker0_containerd.log worker0_firecracker.log worker0_vhive.log

Setup

For completeness, here are the instructions I used to setup the VMs. The nodes were setup on europe-west3-c

1. Image creation

Create an image for nested virtualisation.

gcloud compute disks create nested-kvm-ub18-disk --image-project ubuntu-os-cloud --image-family ubuntu-1804-lts
gcloud compute images create nested-kvm-ub18-image \
  --source-disk nested-kvm-ub18-disk --source-disk-zone $(gcloud config get-value compute/zone) \
  --licenses "https://www.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx"

2. Network setup

2.1 Virtual Private Cloud Network

Create the vhive-vpc custom Virtual Private Cloud (VPC) network to host the Kubernetes cluster:

gcloud compute networks create vhive-vpc --subnet-mode custom

Provision a subnet with a large enough IP range to fit all nodes in the cluster.

gcloud compute networks subnets create kubernetes-nodes \
  --network vhive-vpc \
  --range 10.240.0.0/24

2.2 Firewall rules

Create a firewall rule that allows internal communication across TCP, UDP, ICMP and IP in IP (used for the Calico overlay):

gcloud compute firewall-rules create vhive-vpc-allow-internal \
  --allow tcp,udp,icmp,ipip \
  --network vhive-vpc \
  --source-ranges 10.240.0.0/24

Create a firewall rule that allows external SSH, ICMP, HTTPS and VXLAN.

gcloud compute firewall-rules create vhive-vpc-allow-external \
  --allow tcp:22,tcp:6443,icmp,udp:8472,udp:4789 \
  --network vhive-vpc \
  --source-ranges 0.0.0.0/0

3. Compute Instances

3.1 Master node

gcloud compute instances create controller \
    --async \
    --boot-disk-size 50GB \
    --boot-disk-type pd-ssd \
    --can-ip-forward \
    --image nested-kvm-ub18-image \
    --machine-type n1-standard-2 \
    --private-network-ip 10.240.0.11 \
    --scopes compute-rw,storage-ro,service-management,service-control,logging-write,monitoring \
    --subnet kubernetes-nodes \
    --min-cpu-platform "Intel Haswell" \
    --tags vhive,controller

3.2 Worker nodes

for i in 0 1; do
  gcloud compute instances create worker-${i} \
    --async \
    --boot-disk-size 50GB \
    --boot-disk-type pd-ssd \
    --can-ip-forward \
    --image nested-kvm-ub18-image \
    --machine-type n1-standard-2 \
    --private-network-ip 10.240.0.2${i} \
    --scopes compute-rw,storage-ro,service-management,service-control,logging-write,monitoring \
    --subnet kubernetes-nodes \
    --min-cpu-platform "Intel Haswell" \
    --tags vhive,worker
done

4. Node configuration

4.1 VMX setup

SSH into each node and check that virtualization is enabled by running the following command. A nonzero response confirms that nested virtualization is enabled.

grep -cw vmx /proc/cpuinfo

Then enable KVM by running the following command. This should return "OK" if KVM was enabled successfully.

sudo setfacl -m u:${USER}:rw /dev/kvm
[ -r /dev/kvm ] && [ -w /dev/kvm ] && echo "OK" || echo "FAIL"

4.2 VHive setup

vHive can now be setup using the vHive quick start guide.

ustiugov commented 3 years ago

@amohoste thanks for reporting this issue. From the logs, I can see only the failure of pulling an Docker image for rnn_serving function. This may happen when storage is somewhat slow because its image size is rather big.

We have previously tested running vHive in a single VM (but not in a public cloud) so it should work. I suggest you start by deploying a single-node vHive cluster in a VM. Here are the instructions:

# Setup
mkdir -p ~/logs
./scripts/cloudlab/setup_node.sh
sudo containerd > ~/logs/ctrd.log &
sudo PATH=$PATH /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml  > ~/logs/fc-ctrd.log &
source /etc/profile && go build && sudo ./vhive  > ~/logs/vhive.log &
./scripts/cluster/create_one_node_cluster.sh
# Edit examples/deployer/functions.json and leave only one instance of the helloworld function there
# Then deploy the function that would create urls.txt
go run examples/deployer/client.go 
# Test (it invokes the URL that was previously written to urls.txt)
go run examples/invoker/client.go

If this setup works, we can continue troubleshooting with the multi-node setting. If something doesn't work, please also provide the output of the commands above (along with the logs).

amohoste commented 3 years ago

I tried running only the hello world example as instructed. On the first invocation, everything seems to work fine and the csv file gets populated with the latencies:

$ go run examples/invoker/client.go
INFO[0000] Reading the URLs from the file: urls.txt     
INFO[0005] Issued / completed requests : 4, 4           
INFO[0005] Real / target RPS : 0.80 / 1                 
INFO[0005] Benchmark finished!                          
INFO[0005] The measured latencies are saved in rps0.80_lat.csv.

$ cat rps0.80_lat.csv 
160844
13122
6713
15014

However, on the second invocation none of the requests complete anymore resulting in the CSV file being empty.

$ go run examples/invoker/client.go
INFO[0000] Reading the URLs from the file: urls.txt     
INFO[0005] Issued / completed requests : 4, 0           
INFO[0005] Real / target RPS : 0.00 / 1                 
INFO[0005] Benchmark finished!                          
INFO[0005] The measured latencies are saved in rps0.00_lat.csv.

$ cat rps0.00_lat.csv 

fc-ctrd.log ctrd.log master_containerd.log

ustiugov commented 3 years ago

@amohoste I think that the issue that you observe is not really a problem. The thing is that the invoker runs only for 5 second by default so probably the instances fail to reply by the time the invoker finishes. This is attributable to the fact that by default instances are booted from scratch (not from a snapshot) that takes roughly 3 sec on average. By default, the deployer client deploys functions with autoscaling enabled, meaning that by the time you invoke the function for the second time, all instances may be already down and need to be booted from scratch.

I recommend you to run the invoker for a longer time (e.g., 20 seconds) by setting the corresponding runtime argument.

ustiugov commented 3 years ago

@amohoste Please let us know if the multi-node setup works.

amohoste commented 3 years ago

Setting a longer invoker time indeed seems to resolve the issue for a single node. Thanks for the help. For the multimode setup everything seems to work fine when only running the hello world example:

$ go run examples/deployer/client.go 
go: downloading github.com/sirupsen/logrus v1.8.0
go: downloading golang.org/x/sys v0.0.0-20201201145000-ef89a241ccb3
INFO[0025] Deployed functionhelloworld-0                
INFO[0025] Deployment finished          

$ date
Fri Mar  5 12:55:46 UTC 2021

$ go run examples/invoker/client.go
go: downloading google.golang.org/grpc v1.33.1
go: downloading github.com/golang/protobuf v1.3.5
go: downloading google.golang.org/genproto v0.0.0-20200117163144-32f20d992d24
go: downloading golang.org/x/net v0.0.0-20200707034311-ab3426394381
go: downloading golang.org/x/text v0.3.2
INFO[0000] Reading the URLs from the file: urls.txt     
INFO[0005] Issued / completed requests : 4, 4           
INFO[0005] Real / target RPS : 0.80 / 1                 
INFO[0005] Benchmark finished!                          
INFO[0005] The measured latencies are saved in rps0.80_lat.csv. 

However, when running the default functions.json which also contains rnn-serving and pyaes I encountered some errors. There seem to be some failures related to pulling the rnn_serving docker image indeed, but on invocation also pyaes and hello world are giving some error messages:

$ date
Fri Mar  5 12:58:44 UTC 2021

$ go run examples/deployer/client.go 
INFO[0000] Deployed functionhelloworld-0                
INFO[0090] Deployed functionpyaes-1                     
INFO[0094] Deployed functionpyaes-0                     
WARN[0124] Failed to deploy function rnn-serving-1, configs/knative_workloads/rnn_serving.yaml: exit status 1
Creating service 'rnn-serving-1' in namespace 'default':

  0.111s The Route is still working to reflect the latest desired specification.
  0.297s Configuration "rnn-serving-1" is waiting for a Revision to become ready.
121.832s Revision "rnn-serving-1-00001" failed with message: Container failed with: .
121.846s Configuration "rnn-serving-1" does not have any ready Revision.
Error: RevisionFailed: Revision "rnn-serving-1-00001" failed with message: Container failed with: .
Run 'kn --help' for usage

INFO[0124] Deployed functionrnn-serving-1               
WARN[0125] Failed to deploy function rnn-serving-0, configs/knative_workloads/rnn_serving.yaml: exit status 1
Creating service 'rnn-serving-0' in namespace 'default':

  0.252s The Route is still working to reflect the latest desired specification.
  0.293s ...
  0.375s Configuration "rnn-serving-0" is waiting for a Revision to become ready.
122.799s Revision "rnn-serving-0-00001" failed with message: Container failed with: .
122.813s Configuration "rnn-serving-0" does not have any ready Revision.
Error: RevisionFailed: Revision "rnn-serving-0-00001" failed with message: Container failed with: .
Run 'kn --help' for usage

INFO[0125] Deployed functionrnn-serving-0               
WARN[0126] Failed to deploy function rnn-serving-2, configs/knative_workloads/rnn_serving.yaml: exit status 1
Creating service 'rnn-serving-2' in namespace 'default':

  0.166s The Route is still working to reflect the latest desired specification.
  0.211s Configuration "rnn-serving-2" is waiting for a Revision to become ready.
123.594s Revision "rnn-serving-2-00001" failed with message: Container failed with: .
123.637s Configuration "rnn-serving-2" does not have any ready Revision.
Error: RevisionFailed: Revision "rnn-serving-2-00001" failed with message: Container failed with: .
Run 'kn --help' for usage

INFO[0126] Deployed functionrnn-serving-2               
INFO[0126] Deployment finished

$ go run examples/invoker/client.go -time 60
INFO[0000] Reading the URLs from the file: urls.txt     
WARN[0006] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0012] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0018] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0024] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0030] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0031] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0032] Failed to invoke pyaes-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0033] Failed to invoke pyaes-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0034] Failed to invoke rnn-serving-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0035] Failed to invoke rnn-serving-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0036] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0037] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0038] Failed to invoke pyaes-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0039] Failed to invoke pyaes-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0040] Failed to invoke rnn-serving-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0041] Failed to invoke rnn-serving-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0042] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0043] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0044] Failed to invoke pyaes-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0045] Failed to invoke pyaes-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0046] Failed to invoke rnn-serving-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0047] Failed to invoke rnn-serving-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0048] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0049] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0050] Failed to invoke pyaes-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0051] Failed to invoke pyaes-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0052] Failed to invoke rnn-serving-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0053] Failed to invoke rnn-serving-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0054] Failed to invoke rnn-serving-2.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0055] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = Unavailable desc = upstream request timeout 
WARN[0056] Failed to invoke pyaes-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0057] Failed to invoke pyaes-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0058] Failed to invoke rnn-serving-0.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
WARN[0059] Failed to invoke rnn-serving-1.default.192.168.1.240.xip.io:80, err=rpc error: code = DeadlineExceeded desc = context deadline exceeded 
INFO[0060] Issued / completed requests : 59, 34         
INFO[0060] Real / target RPS : 0.57 / 1                 
INFO[0060] Benchmark finished!                          
INFO[0060] The measured latencies are saved in rps0.57_lat.csv. 

master-ctrd.log worker0-ctrd.log worker0-vhive.log worker0-fc-ctrd.log

ustiugov commented 3 years ago

@amohoste it seems that our timeouts are too shot. Could you please specify the bandwidth of the network and the storage type that you have mounted as VMs' / root partition?

Could you please also specify the target deployment for your needs? and, if you can, the target workloads

amohoste commented 3 years ago

I was using 2 n1-standard-4 (4 vCPUs, 15 GB memory) nodes with storage type SSD PersistentDisk for the measurements in my previous post.

According to documentation, the max egress bandwidth for these machines is 10 Gbps. I also ran some iperf measurements for 60 seconds between the two nodes on TCP. For local traffic I receive bandwidths of 7.77 Gbps with one thread and 9.72 Gbps using 4 threads. For external traffic I obtained 4.82 Gbps with one thread and 6.67 Gbps using 4 threads.

I am considering using vHive to experiment with implementing and evaluating different Serverless cloud function scheduling strategies as part of my master thesis. Consequentially, I will mainly be running benchmarks that should be representative to real Serverless workloads. As far as the target deployment is concerned, I am planning to use a Google Cloud setup with a good amount of nodes for the benchmarks.

ustiugov commented 3 years ago

Ok, thanks for the details.

I think that for the setup that you described, pulling function images from Dockerhub at each worker node becomes a clear bottleneck, resulting in timeouts. What would make sense to do is to pre-pull the Docker images of the functions and stateful services that you intend to use into a k8s-cluster local registry. This is not vHive-specific, any open-source FaaS would have the same problem.

Would you be willing to contribute this feature if we would provide all necessary guidance? Basically, you would need to deploy a docker-registry service, then we can add a runtime argument for ctriface/ module to use a local registry instead of the default one.

Please let me know what you think.

amohoste commented 3 years ago

Sure! I could look into that. I am currently evaluating the suitability of vHive versus other open source platforms for my use case but I'm leaning towards using vHive. Maybe we could discuss this along with the feature you described above through Slack or some other platform?

ustiugov commented 3 years ago

@amohoste sure, please message me on Slack and let's discuss

ustiugov commented 3 years ago

closing as the infrastructure works on GCP.

amohoste commented 3 years ago

I am again running into issues where services fail to deploy. I think it might be related to the following error: cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

ahoste@controller:~/vhive$ go run examples/deployer/client.go
go: downloading github.com/sirupsen/logrus v1.8.0
go: downloading golang.org/x/sys v0.0.0-20201201145000-ef89a241ccb3
WARN[0123] Failed to deploy function helloworld-0, configs/knative_workloads/helloworld.yaml: exit status 1
Creating service 'helloworld-0' in namespace 'default':

  0.063s The Route is still working to reflect the latest desired specification.
  0.080s ...
  0.108s Configuration "helloworld-0" is waiting for a Revision to become ready.
121.762s Revision "helloworld-0-00001" failed with message: .
121.779s Configuration "helloworld-0" does not have any ready Revision.
Error: RevisionFailed: Revision "helloworld-0-00001" failed with message: .
Run 'kn --help' for usage 

worker-0_fc-ctrd.log worker-0_vhive.log worker-0_ctrd.log controller_ctrd.log

ustiugov commented 3 years ago

@amohoste no, the CNI error message is not a real issue, just an artifact of how we start a cluster.

As of now, deployment is not a critical issue because this is not what vHive users benchmark (as of now at least). For now, re-deploying should suffice.

You may want to raise a separate issue for this problem and attach the logs. However, it is not going to be fixed soon because it is of a rather low priority and we do not see an easy fix as of now. Finally, in future, please re-open an issue if you add a comment because otherwise, it may fall through the cracks.

ustiugov commented 3 years ago

There was a problem when upgrading the dependencies' binaries that was not caught by the CI. It's already fixed. Please start with a clean set up.

ustiugov commented 3 years ago

also, we decided to re-iterate on existing sporadic failures asap. Hopefully, we'll fix this sporadic failure too.

amohoste commented 3 years ago

@ustiugov Thanks for the heads up. Unfortunately, even now it still seems to be failing. Redeploying multiple times didn't help on the multinode setup. I also tried the one node setup with the fixed binaries but I am running into the same issue. If everything is working on your I can only imagine this beïng something related to google cloud. I am not able to re-open the issue however.

ahoste@controller:~/vhive$ go run examples/deployer/client.go 
WARN[0124] Failed to deploy function helloworld-0, configs/knative_workloads/helloworld.yaml: exit status 1
Creating service 'helloworld-0' in namespace 'default':

  0.096s The Route is still working to reflect the latest desired specification.
  0.153s ...
  0.176s Configuration "helloworld-0" is waiting for a Revision to become ready.
121.839s Revision "helloworld-0-00001" failed with message: .
121.957s Configuration "helloworld-0" does not have any ready Revision.
Error: RevisionFailed: Revision "helloworld-0-00001" failed with message: .
Run 'kn --help' for usage

INFO[0124] Deployed functionhelloworld-0                
INFO[0124] Deployment finished

controller_ctrd.log controller_fc-ctrd.log controller_vhive.log

ustiugov commented 3 years ago

@amohoste interesting, is your repo up-to-date with the last commits from yesterday? we gonna check asap.

amohoste commented 3 years ago

Correct, I'll try testing it with a commit from 15 days ago to double check if this is due to google cloud or a newly introduced issue

ustiugov commented 3 years ago

try March 5th commit: 5f69089a6872a03bc88f9c2b2e7a4584a6d8f834

amohoste commented 3 years ago

Quick update, I am still investigating what is wrong. The issue also seems to appear on the March 5th commit.

amohoste commented 3 years ago

I haven't been able to resolve the issue. The first time deploying, I get the usual timeout:

ahoste@controller:~/vhive$ go run examples/deployer/client.go 
WARN[0124] Failed to deploy function helloworld-0, configs/knative_workloads/helloworld.yaml: exit status 1
Creating service 'helloworld-0' in namespace 'default':

  0.063s The Route is still working to reflect the latest desired specification.
  0.085s Configuration "helloworld-0" is waiting for a Revision to become ready.
  0.137s ...
122.155s Revision "helloworld-0-00001" failed with message: .
122.170s Configuration "helloworld-0" does not have any ready Revision.
Error: RevisionFailed: Revision "helloworld-0-00001" failed with message: .
Run 'kn --help' for usage

INFO[0124] Deployed functionhelloworld-0                
INFO[0124] Deployment finished                          

Upon invoking the deployer a second time, the deployment finishes almost instantly

ahoste@controller:~/vhive$ go run examples/deployer/client.go 
INFO[0000] Deployed functionhelloworld-0                
INFO[0000] Deployment finished     

However, Invoking the helloworld function does not seem to work

ahoste@controller:~/vhive$ go run examples/invoker/client.go 
INFO[0000] Reading the URLs from the file: urls.txt     
WARN[0001] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0002] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0003] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
WARN[0004] Failed to invoke helloworld-0.default.192.168.1.240.xip.io:80, err=rpc error: code = Unimplemented desc =  
INFO[0005] Issued / completed requests : 4, 4           
INFO[0005] Real / target RPS : 0.80 / 1                 
INFO[0005] Benchmark finished!                          
INFO[0005] The measured latencies are saved in rps0.80_lat.csv. 

Only the containerd logs contain error messages which seem to be network related: controller_ctrd.log controller_fc-ctrd.log controller_vhive.log