vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
290 stars 90 forks source link

deployer fails with while created firecracker microVM #423

Closed prateeksahu closed 3 years ago

prateeksahu commented 3 years ago

Describe the bug Example functions deployment gives CreateContainerError in a single node cluster deployment.

Setup Single node vHive deployment on VM (x86 machine). Host OS: Ubuntu 20.04, Guest OS: Ubuntu 18.04 Linux: 4.15

Expected behavior Deployer should be able to deploy the functions

Logs

$ kubectl get po                                                                                                                                                                                                                                            [68/68]
NAME                                              READY   STATUS                 RESTARTS   AGE
helloworld-0-00001-deployment-77b5f864df-5h99k    0/2     CreateContainerError   12         2m41s
pyaes-0-00001-deployment-fcdb7f47d-x44xc          0/2     CreateContainerError   10         2m41s
pyaes-1-00001-deployment-5fbc7bbc4d-xcbtx         0/2     CreateContainerError   12         2m41s
rnn-serving-0-00001-deployment-68b4c6cdcc-x5h2m   0/2     CreateContainerError   9          2m41s
rnn-serving-1-00001-deployment-7bc96dfcd4-bgkt8   0/2     CreateContainerError   11         2m41s
$ kubectl desc pods helloworld-0-00001-deployment-77b5f864df-5h99k
...
Events:         
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m8s                  default-scheduler  Successfully assigned default/helloworld-0-00001-deployment-77b5f864df-5h99k to prateek
  Normal   Pulling    3m7s                  kubelet            Pulling image "index.docker.io/crccheck/hello-world@sha256:6169c93b00dc3b9cfb755ed8acd3c139b8bb7eac339f67ea94cfec95d5743755"
  Normal   Pulled     3m4s                  kubelet            Successfully pulled image "index.docker.io/crccheck/hello-world@sha256:6169c93b00dc3b9cfb755ed8acd3c139b8bb7eac339f67ea94cfec95d5743755" in 2.899066175s
  Normal   Pulling    2m40s                 kubelet            Pulling image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:b1d52fca42963022e9545092ef95f27e911de5d0625fae1e5881a7f59fca444a"
  Normal   Pulled     2m36s                 kubelet            Successfully pulled image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:b1d52fca42963022e9545092ef95f27e911de5d0625fae1e5881a7f59fca444a" in 4.174366777s
  Warning  Failed     117s (x5 over 2m36s)  kubelet            Error: VM config for pod does not exist
  Normal   Pulled     116s (x5 over 2m36s)  kubelet            Container image "index.docker.io/crccheck/hello-world@sha256:6169c93b00dc3b9cfb755ed8acd3c139b8bb7eac339f67ea94cfec95d5743755" already present on machine
  Warning  Failed     115s (x6 over 2m40s)  kubelet            Error: failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put "http://localhost/actions": EOF
  Normal   Pulled     115s (x5 over 2m34s)  kubelet            Container image "docker.io/vhiveease/queue-39be6f1d08a095bd076a71d288d295b6@sha256:b1d52fca42963022e9545092ef95f27e911de5d0625fae1e5881a7f59fca444a" already present on machine

firecracker-containerd logs:

...
DEBU[2021-10-18T17:45:26.620963465Z] create VM request: VMID:"9" MachineCfg:<MemSizeMib:256 VcpuCount:1 > KernelArgs:"ro noapic reboot=k panic=1 pci=off nomodules systemd.log_color=false systemd.unit=firecracker.target init=/sbin/overlay-init tsc=reliable quiet 8250.nr_uarts=0 ipv$
.disable=1" NetworkInterfaces:<StaticConfig:<MacAddress:"02:FC:00:00:00:08" HostDevName:"9_tap" IPConfig:<PrimaryAddr:"190.128.0.10/10" GatewayAddr:"190.128.0.1" Nameservers:"10.96.0.10" Nameservers:"8.8.8.8" > > > TimeoutSeconds:100                                                 
DEBU[2021-10-18T17:45:26.620994917Z] using namespace: firecracker-containerd                                                                                                                                                                                                              
DEBU[2021-10-18T17:45:26.621303774Z] starting containerd-shim-aws-firecracker      vmID=9                                                                                                                                                                                                 
ERRO[2021-10-18T17:45:26.638479426Z] Starting instance: Put "http://localhost/actions": EOF  runtime=aws.firecracker                                                                                                                                                                      
WARN[2021-10-18T17:45:26.638486565Z] firecracker exited: signal: aborted (core dumped)  runtime=aws.firecracker                                                                                                                                                                           
ERRO[2021-10-18T17:45:26.638572713Z] failed to create VM                           error="failed to start the VM: Put \"http://localhost/actions\": EOF" runtime=aws.firecracker vmID=10                                                                                                  
ERRO[2021-10-18T17:45:26.638795905Z] shim CreateVM returned error                  error="rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                                                                            
DEBU[2021-10-18T17:45:26.644602646Z] shim has been terminated                      error="signal: killed" vmID=10                                                                                                                                                                         
INFO[2021-10-18T17:45:26.654016473Z] starting signal loop                          namespace=firecracker-containerd path=/var/lib/firecracker-containerd/shim-base/firecracker-containerd/9 pid=32274                                                                                     
INFO[2021-10-18T17:45:26.654525430Z] creating new VM                               runtime=aws.firecracker vmID=9                                                                                                                                                                         
INFO[2021-10-18T17:45:26.654694938Z] Called startVMM(), setting up a VMM on firecracker.sock  runtime=aws.firecracker                                                                                                                                                                     
INFO[2021-10-18T17:45:26.666352624Z] refreshMachineConfiguration: [GET /machine-config][200] getMachineConfigurationOK  &{CPUTemplate:T2 HtEnabled:0xc000481b43 MemSizeMib:0xc000481b38 TrackDirtyPages:false VcpuCount:0xc000481b30}  runtime=aws.firecracker                            
INFO[2021-10-18T17:45:26.666503154Z] PutGuestBootSource: [PUT /boot-source][204] putGuestBootSourceNoContent   runtime=aws.firecracker                                                                                                                                                    
INFO[2021-10-18T17:45:26.666533556Z] Attaching drive /var/lib/firecracker-containerd/runtime/default-rootfs.img, slot root_drive, root true.  runtime=aws.firecracker                                                                                                                     
INFO[2021-10-18T17:45:26.666748476Z] Attached drive /var/lib/firecracker-containerd/runtime/default-rootfs.img: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent   runtime=aws.firecracker                                                                                        
INFO[2021-10-18T17:45:26.666785875Z] Attaching drive /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0, slot MN2HE43UOVRDA, root false.  runtime=aws.firecracker                                                                                                
INFO[2021-10-18T17:45:26.667005922Z] Attached drive /var/lib/firecracker-containerd/shim-base/firecracker-containerd/9/ctrstub0: [PUT /drives/{drive_id}][204] putGuestDriveByIdNoContent   runtime=aws.firecracker                                                                       
INFO[2021-10-18T17:45:26.667020186Z] Attaching NIC 9_tap (hwaddr 02:FC:00:00:00:08) at index 1  runtime=aws.firecracker                                                                                                                                                                   
DEBU[2021-10-18T17:45:26.789335193Z] schedule snapshotter cleanup                  snapshotter=devmapper                                                                                                                                                                                  
DEBU[2021-10-18T17:45:26.789434564Z] walk                                                                                                                                                                                                                                                 
DEBU[2021-10-18T17:45:26.789681117Z] remove                                        key="firecracker-containerd/54/extract-635610719-h-EA sha256:d7dccd214b2b808d39d11264689977e780b9e10662398cdae4fdc734fd008cdb"                                                                         
WARN[2021-10-18T17:45:26.812059208Z] firecracker exited: signal: aborted (core dumped)  runtime=aws.firecracker                                                                                                                                                                           
ERRO[2021-10-18T17:45:26.812087438Z] Starting instance: Put "http://localhost/actions": EOF  runtime=aws.firecracker                                                                                                                                                                      
ERRO[2021-10-18T17:45:26.812194602Z] failed to create VM                           error="failed to start the VM: Put \"http://localhost/actions\": EOF" runtime=aws.firecracker vmID=9                                                                                                   
ERRO[2021-10-18T17:45:26.812407075Z] shim CreateVM returned error                  error="rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                                                                            
DEBU[2021-10-18T17:45:26.820182471Z] shim has been terminated                      error="signal: killed" vmID=9                                                                                                                                                                          
DEBU[2021-10-18T17:45:26.965491708Z] removed snapshot                              key="firecracker-containerd/54/extract-635610719-h-EA sha256:d7dccd214b2b808d39d11264689977e780b9e10662398cdae4fdc734fd008cdb" snapshotter=devmapper                                                   
DEBU[2021-10-18T17:45:26.965572457Z] snapshot garbage collected                    d=176.163343ms snapshotter=devmapper                                                                                                                                                                   
DEBU[2021-10-18T17:45:26.966344608Z] garbage collected                             d=12.146462ms                              
...

Above logs repeat a lot. vhive logs:

===================================                                                                                                                                                                                                                                                       
INFO[2021-10-18T17:44:51.816810535Z] HEARTBEAT: number of active VMs: 5                                                                                                                                                                                                                   
INFO[2021-10-18T17:44:51.817017950Z] FuncPool heartbeat: ==== Stats by cold functions ====                                                                                                                                                                                                
fID, #started, #served                                                                                                                                                                                                                                                                    
===================================                                                                                                                                                                                                                                                       
ERRO[2021-10-18T17:45:08.003519485Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/pyaes:var_workload" vmID=5                                                                                                                                                                                                                                                           
ERRO[2021-10-18T17:45:08.003820798Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"
ERRO[2021-10-18T17:45:08.037097468Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/helloworld:var_workload" vmID=4                                                                                                  
ERRO[2021-10-18T17:45:08.037150168Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                    
ERRO[2021-10-18T17:45:08.052972530Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/pyaes:var_workload" vmID=1                                                                                                                                                                                                                                                           
ERRO[2021-10-18T17:45:08.053129152Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                    
ERRO[2021-10-18T17:45:11.522611578Z] VM config for pod b4b8d340564bd28b392d83c00454f67f365a6b3b5cc4e2876a42c463382b1e7b does not exist                                                                                                                                  
ERRO[2021-10-18T17:45:11.522692715Z]                                               error="VM config for pod does not exist"                                                                                                                                                               
ERRO[2021-10-18T17:45:12.220972481Z] VM config for pod 84934e0510998426e2217f1844f9c81199c7e92da6f9231c6fbd5c85ce09abd3 does not exist                                                                                                                                                    
ERRO[2021-10-18T17:45:12.221020488Z]                                               error="VM config for pod does not exist"                                                                                                                                                               
ERRO[2021-10-18T17:45:12.738592902Z] VM config for pod 34cd26f8c6b05540cda6b46266ca177ac2765f0e4da5abea18aaa861d3cd6c65 does not exist                                                                                                                                                    
ERRO[2021-10-18T17:45:12.738617608Z]                                               error="VM config for pod does not exist"                                                                                                                                           
ERRO[2021-10-18T17:45:14.370229027Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/pyaes:var_workload" vmID=6                                                                                                                                                                                                                                                           
ERRO[2021-10-18T17:45:14.370296948Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                    
ERRO[2021-10-18T17:45:14.378997549Z] VM config for pod b4b8d340564bd28b392d83c00454f67f365a6b3b5cc4e2876a42c463382b1e7b does not exist                                                                                                                                                    
ERRO[2021-10-18T17:45:14.379076368Z]                                               error="VM config for pod does not exist"                                                                                                                                                               
ERRO[2021-10-18T17:45:14.391006115Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/helloworld:var_workload" vmID=7                                                                                                                                                                                                                                                      
ERRO[2021-10-18T17:45:14.391069956Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                    
ERRO[2021-10-18T17:45:14.400535559Z] VM config for pod 84934e0510998426e2217f1844f9c81199c7e92da6f9231c6fbd5c85ce09abd3 does not exist                                                                                                                                                    
ERRO[2021-10-18T17:45:14.400624291Z]                                               error="VM config for pod does not exist"                                                                                                                                                               
ERRO[2021-10-18T17:45:14.408919050Z] coordinator failed to start VM                error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF" image="ghcr.io/ease
-lab/pyaes:var_workload" vmID=8                                                                                                                                                                                                                                                           
ERRO[2021-10-18T17:45:14.408977445Z] failed to start VM                            error="failed to create the microVM in firecracker-containerd: rpc error: code = Unknown desc = failed to create VM: failed to start the VM: Put \"http://localhost/actions\": EOF"                    
ERRO[2021-10-18T17:45:14.411870357Z] VM config for pod 34cd26f8c6b05540cda6b46266ca177ac2765f0e4da5abea18aaa861d3cd6c65 does not exist                                                                                                                                                    
ERRO[2021-10-18T17:45:14.411891457Z]                                               error="VM config for pod does not exist"           

Similar logs repeat for vhive too.

I understand that VM is currently not supported, but I would like to understand what is the cause for this. Running bare-metal on 20.04 also causes the setup script to break sometimes unexpectedly hence I gave VM a try.

ustiugov commented 3 years ago

hi @prateeksahu, we currently support only Ubuntu 18 for the host OS (albeit I hear that Ubuntu 20 should be fine but we haven't officially moved to it just yet).

Deploying in a VM should not be an issue, as long as you configure the VM to support KVM (i.e., expose VT-x to the VM).

Could you please try to reproduce the problem on a Ubuntu 18 host OS and check that the host VM has VT-x/KVM enabled?

Also, can you please list the exact commands that one can run to reproduce the problem?

prateeksahu commented 3 years ago

Hey @ustiugov Thanks, I found out the i had not enabled the VT-x on my VM. That seems to have solved the issue. We can close this bug. One query though, upon clean exit of the vhive program, any restart fails with port 3334/3333 as being busy. Any clue into this?

ustiugov commented 3 years ago

@prateeksahu glad to hear!

vHive executable listens on these ports but should have been freed after a graceful shut down... Nevertheless, you can run the following script for a complete cleanup: ./scripts/github_runner/clean_cri_runner.sh firecracker