vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
265 stars 84 forks source link

fix github runner gvisor failure and add gocache #977

Closed JooyoungPark73 closed 2 months ago

JooyoungPark73 commented 2 months ago

We face gVisor runner failing all the time. There are two reasons:

Wrong endpoint caused whole error, and container not being cleaned up properly

W0424 11:02:04.475265  259894 cleanupnode.go:99] [reset] Failed to remove containers: [failed to stop running pod I0424: output: I0424 11:01:52.572194  260177 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:52.579298  260177 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"I0424\": not found" podSandboxID="I0424"
time="2024-04-24T11:01:52Z" level=fatal msg="stopping the pod sandbox \"I0424\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"I0424\": not found"
: exit status 1, failed to stop running pod 11:01:52.417320: output: I0424 11:01:52.707750  260250 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:52.711341  260250 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"11:01:52.417320\": not found" podSandboxID="11:01:52.417320"
time="2024-04-24T11:01:52Z" level=fatal msg="stopping the pod sandbox \"11:01:52.417320\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"11:01:52.417320\": not found"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
: exit status 1, failed to stop running pod 260012: output: I0424 11:01:52.833727  260320 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:52.837811  260320 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"260012\": not found" podSandboxID="260012"
time="2024-04-24T11:01:52Z" level=fatal msg="stopping the pod sandbox \"260012\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"260012\": not found"
: exit status 1, failed to stop running pod util_unix.go:103]: output: I0424 11:01:52.942331  260371 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:52.946834  260371 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"util_unix.go:103]\": not found" podSandboxID="util_unix.go:103]"
time="2024-04-24T11:01:52Z" level=fatal msg="stopping the pod sandbox \"util_unix.go:103]\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"util_unix.go:103]\": not found"
: exit status 1, failed to stop running pod "Using: output: I0424 11:01:53.049928  260431 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.055111  260431 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"\\\"Using\": not found" podSandboxID="\"Using"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"\\\"Using\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"\\\"Using\": not found"
: exit status 1, failed to stop running pod this: output: I0424 11:01:53.188970  260495 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.[19](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:20)2391  260495 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"this\": not found" podSandboxID="this"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"this\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"this\": not found"
: exit status 1, failed to stop running pod endpoint: output: I0424 11:01:53.299874  260564 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.303466  260564 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"endpoint\": not found" podSandboxID="endpoint"
time="[20](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:21)24-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"endpoint\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"endpoint\": not found"
: exit status 1, failed to stop running pod is: output: I0424 11:01:53.405669  260629 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.410281  260629 remote_runtime.go:[22](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:23)2] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"is\": not found" podSandboxID="is"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"is\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"is\": not found"
: exit status 1, failed to stop running pod deprecated,: output: I0424 11:01:53.513228  260677 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E04[24](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:25) 11:01:53.516442  260677 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"deprecated,\": not found" podSandboxID="deprecated,"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"deprecated,\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"deprecated,\": not found"
: exit status 1, failed to stop running pod please: output: I0424 11:01:53.624314  260748 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.628372  260748 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"please\": not found" podSandboxID="please"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"please\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"please\": not found"
: exit status 1, failed to stop running pod consider: output: I0424 11:01:53.731128  260819 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.735064  260819 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"consider\": not found" podSandboxID="consider"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"consider\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"consider\": not found"
: exit status 1, failed to stop running pod using: output: I0424 11:01:53.832924  260872 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.836874  260872 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"using\": not found" podSandboxID="using"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"using\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"using\": not found"
: exit status 1, failed to stop running pod full: output: I0424 11:01:53.927486  260933 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:53.931793  260933 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"full\": not found" podSandboxID="full"
time="2024-04-24T11:01:53Z" level=fatal msg="stopping the pod sandbox \"full\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"full\": not found"
: exit status 1, failed to stop running pod URL: output: I0424 11:01:54.036985  261003 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:54.040244  261003 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"URL\": not found" podSandboxID="URL"
time="2024-04-24T11:01:54Z" level=fatal msg="stopping the pod sandbox \"URL\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"URL\": not found"
: exit status 1, failed to stop running pod format": output: I0424 11:01:54.143087  261054 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:54.149392  261054 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"format\\\"\": not found" podSandboxID="format\""
time="2024-04-24T11:01:54Z" level=fatal msg="stopping the pod sandbox \"format\\\"\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"format\\\"\": not found"
: exit status 1, failed to stop running pod endpoint="/etc/vhive-cri/vhive-cri.sock": output: I0424 11:01:54.2703[25](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:26)  261170 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:54.273350  261170 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"endpoint=\\\"/etc/vhive-cri/vhive-cri.sock\\\"\": not found" podSandboxID="endpoint=\"/etc/vhive-cri/vhive-cri.sock\""
time="2024-04-24T11:01:54Z" level=fatal msg="stopping the pod sandbox \"endpoint=\\\"/etc/vhive-cri/vhive-cri.sock\\\"\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"endpoint=\\\"/etc/vhive-cri/vhive-cri.sock\\\"\": not found"
: exit status 1, failed to stop running pod URL="unix:///etc/vhive-cri/vhive-cri.sock": output: I0424 11:01:54.370869  [26](https://github.com/vhive-serverless/vHive/actions/runs/8815318800/job/24197027374#step:10:27)1262 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/etc/vhive-cri/vhive-cri.sock" URL="unix:///etc/vhive-cri/vhive-cri.sock"
E0424 11:01:54.374940  261262 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find sandbox \"URL=\\\"unix:///etc/vhive-cri/vhive-cri.sock\\\"\": not found" podSandboxID="URL=\"unix:///etc/vhive-cri/vhive-cri.sock\""
time="2024-04-24T11:01:54Z" level=fatal msg="stopping the pod sandbox \"URL=\\\"unix:///etc/vhive-cri/vhive-cri.sock\\\"\": rpc error: code = NotFound desc = an error occurred when try to find sandbox \"URL=\\\"unix:///etc/vhive-cri/vhive-cri.sock\\\"\": not found"
: exit status 1]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/super-admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

Also, the file system removal is not being properly working.

Cleaning /run/gvisor-containerd/gvisor-containerd.sock /run/gvisor-containerd/gvisor-containerd.sock.ttrpc /run/gvisor-containerd/io.containerd.runtime.v1.linux /run/gvisor-containerd/io.containerd.runtime.v2.task
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/16/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/15/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/14/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/13/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/12/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/11/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/10/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/9/rootfs': Device or resource busy
rm: cannot remove '/run/gvisor-containerd/io.containerd.runtime.v2.task/default/8/rootfs': Device or resource busy
Cleaning /var/lib/gvisor-containerd/containerd

I also added go action caching (but minor) I add github runner to check the go.modfile and get the go version automatically. Except for build test.