ros2 / ci

ROS 2 CI Infrastructure
http://ci.ros2.org/
Apache License 2.0
48 stars 30 forks source link

`error waiting for container: unexpected EOF` failing builds #702

Closed claraberendsen closed 1 year ago

claraberendsen commented 1 year ago

Description

We have been encountering the error of error waiting for container: unexpected EOF frequently on ci.ros2.org. Seems to be an issue with memory availability for the container.

Example build

https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2664/console

Statistics on the issue for the first two weeks of May:

nightly_linux_debug: 20% of the times nightly_linux-aarch64_release: 13.33% of the times nightly_linux-aarch64_repeated: 13.33% of the times nightly_linux_release: 7.14% of the times

Crola1702 commented 1 year ago

Container EOF error statistics

#### 15 days |job_name|last_fail|first_fail|build_count|failure_count|failure_percentage| | -- | -- | -- | -- | -- | -- | |nightly_linux_debug|2023-05-19|2023-05-10|16|5|31.25| |nightly_linux_foxy_coverage|2023-05-18|2023-05-07|16|2|12.5| |nightly_linux-rhel_repeated|2023-05-18|2023-05-06|21|5|23.81| |nightly_linux-rhel_release|2023-05-18|2023-05-12|15|4|26.67| |nightly_linux-rhel_debug|2023-05-18|2023-05-06|15|3|20.0| |Humble cyclonedds|2023-05-18|2023-05-09|6|2|33.33| |Humble connext|2023-05-18|2023-05-06|5|2|40.0| |Foxy connext|2023-05-14|2023-05-14|5|1|20.0| |Rolling cyclonedds|2023-05-13|2023-05-13|17|1|5.88| |Rolling connext|2023-05-13|2023-05-13|16|1|6.25| |nightly_linux_release|2023-05-12|2023-05-12|24|1|4.17| |nightly_linux-aarch64_release|2023-05-11|2023-05-10|15|2|13.33| |nightly_linux-aarch64_repeated|2023-05-10|2023-05-10|15|1|6.67|
#### 5 days job_name|last_fail|first_fail|build_count|failure_count|failure_percentage -- | -- | -- | -- | -- | -- nightly_linux_debug|2023-05-19|2023-05-18|6|2|33.33 nightly_linux_foxy_coverage|2023-05-18|2023-05-18|6|1|16.67 nightly_linux-rhel_repeated|2023-05-18|2023-05-17|6|2|33.33 nightly_linux-rhel_release|2023-05-18|2023-05-16|5|3|60.0 nightly_linux-rhel_debug|2023-05-18|2023-05-18|5|1|20.0 Humble cyclonedds|2023-05-18|2023-05-18|2|1|50.0 Humble connext|2023-05-18|2023-05-18|2|1|50.0 Foxy connext|2023-05-14|2023-05-14|2|1|50.0
claraberendsen commented 1 year ago

Today two aarch64 nightlies failed with this issue: nightly_linux-aarch64_debug #2384 and nightly_linux-aarch64_repeated #2371. Since the agent didn't get destroyed afterwards the logs were preserved and a detailed investigation was possible.

Investigation

1. Rule out OOM killer stopping the docker container. The command journalctl -k was run to verify that the OOM killer process of the OS triggered either the daemon or the docker process to die. No indication of OOM issues was found in log.

2. Check docker.service logs to have a better indication of what was happening Run the command journalctl -u docker.service .In the docker logs we could see that systemd was causing the docker daemon to restart. See the following lines : Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'" Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete" And then this line Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby Similar logs can be found on the other agent the nightly was run. Note that the timing of these logs is close to when the error happens in the builds per each agent. This is a good indication that the errors are happening due to the restart of the docker service.

``` Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'" Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.430955206Z" level=info msg="ignoring event" container=88acb4d47b1e53cb20be3fa2c2c3c43472b51f9fe05b9d81a4e7a4aacf46ad8a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: run-docker-netns-d1893231c907.mount: Succeeded. Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-1c135b999632ae831e47372d4458c53f1bf556a3062789f60394c15a953ed313-merged.mount: Succeeded. Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete" Jun 06 06:44:13 ip-10-0-1-221 systemd[1]: docker.service: Succeeded. Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.834629993Z" level=info msg="Starting up" Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.835519557Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf" Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839813954Z" level=info msg="parsed scheme: \"unix\"" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839843632Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839870011Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] }" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.839881745Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843368324Z" level=info msg="parsed scheme: \"unix\"" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843398814Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843419565Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] }" module=grpc Jun 06 06:44:13 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:13.843431356Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083075793Z" level=warning msg="Your kernel does not support CPU realtime scheduler" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083108146Z" level=warning msg="Your kernel does not support cgroup blkio weight" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083115251Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.083282143Z" level=info msg="Loading containers: start." Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.220468635Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.551039037Z" level=info msg="Loading containers: done." Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.588748768Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21 Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.589147012Z" level=info msg="Daemon has completed initialization" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.609178145Z" level=info msg="API listen on /run/docker.sock" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.976374499Z" level=info msg="Processing signal 'terminated'" Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977224089Z" level=info msg="stopping event stream following graceful shutdown" error="" module=libcontainerd namespace=moby Jun 06 06:44:14 ip-10-0-1-221 dockerd[498829]: time="2023-06-06T06:44:14.977467411Z" level=info msg="Daemon shutdown complete" Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.service: Succeeded. Jun 06 06:44:14 ip-10-0-1-221 systemd[1]: docker.socket: Succeeded. Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.285510408Z" level=info msg="Starting up" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.286470569Z" level=info msg="detected 127.0.0.53 nameserver, assuming systemd-resolved, so using resolv.conf: /run/systemd/resolve/resolv.conf" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287077320Z" level=info msg="parsed scheme: \"unix\"" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287100057Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287121070Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] }" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.287131712Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288953180Z" level=info msg="parsed scheme: \"unix\"" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288979888Z" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.288998989Z" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] }" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.289011009Z" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362675376Z" level=warning msg="Your kernel does not support CPU realtime scheduler" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362707343Z" level=warning msg="Your kernel does not support cgroup blkio weight" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362715433Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.362883835Z" level=info msg="Loading containers: start." Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.495188699Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.535984330Z" level=info msg="Loading containers: done." Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555383161Z" level=info msg="Docker daemon" commit="20.10.21-0ubuntu1~20.04.2" graphdriver(s)=overlay2 version=20.10.21 Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.555446997Z" level=info msg="Daemon has completed initialization" Jun 06 06:44:15 ip-10-0-1-221 systemd[1]: var-lib-docker-overlay2-check\x2doverlayfs\x2dsupport223685642-merged.mount: Succeeded. Jun 06 06:44:15 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:15.580350166Z" level=info msg="API listen on /run/docker.sock" Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802841899Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802922719Z" level=info msg="blockingPicker: the picked transport is not ready, loop back to repick" module=grpc Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802963933Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802999568Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby ```

3. Check journalctl logs in general Executing journalctl between the time of failures we can see the following a couple of minutes before:

Jun 06 06:25:01 ip-10-0-1-221 CRON[428020]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 06 06:25:01 ip-10-0-1-221 CRON[428018]: pam_unix(cron:session): session closed for user root
Jun 06 06:34:19 ip-10-0-1-221 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 06 06:35:56 ip-10-0-1-221 dbus-daemon[592]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.28' (uid=0 pid=444913 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")

Which triggers a restart of most daemons by systemd. This Daily apt upgrade log indicates that an unattended-upgrade is being run.

Conclusion

It would seem that the latest issues with error waiting for container: unexpected EOF are related to the timing of the unattended-upgrades and not by a OOM or CPU issue like initially consider. Even though we cannot rule out that an OOM issue may be causing some of them, it seems there is a periodicity and timing to which jobs fail, affecting mostly nightlies and in the same time. As an example the nightly_linux-aarch64_debug has failed on two different agents but on similar times of day in the last two runs #2383 and #2384.

Further investigation / Not answered questions

claraberendsen commented 1 year ago

Today another nightly failed with this error nightly_linux-rhel_debug In the logs we see the time of failure is ...

06:52:51 time="2023-06-07T06:52:52Z" level=error msg="error waiting for container: unexpected EOF"

If we ssh into the instance we see that the unnatended upgrade was run during that time as well

Jun 07 06:40:39 ip-10-0-1-235 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 07 06:53:35 ip-10-0-1-235 systemd[1]: Finished Daily apt upgrade and clean activities.
claraberendsen commented 1 year ago

Replication Build Status

A test job was run on an agent and the docker service was restarted with sudo systemctl restart docker.

claraberendsen commented 1 year ago

aarch builds have been green since the weekend.

Blast545 commented 1 year ago

Waiting for: https://github.com/osrf/chef-osrf/pull/204

ros-discourse commented 1 year ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-infrastructure-updates-from-june-2023/32258/1

wep21 commented 1 year ago

I have encountered the similar issue that says error waiting for container: unexpected EOF. https://build.ros2.org/job/Hbin_uJ64__aws_sdk_cpp_vendor__ubuntu_jammy_amd64__binary/16/display/redirect https://build.ros2.org/job/Rbin_ujv8_uJv8__aws_sdk_cpp_vendor__ubuntu_jammy_arm64__binary/11/display/redirect

claraberendsen commented 1 year ago

@wep21 thanks for reporting! The PR to fix has gotten lost in the weeds of other things but given that the error is appearing again I have re-prioritized it and added to my backlog for this week. This however is most likely unrelated to the general error of Unexpected EOF .

Crola1702 commented 1 year ago

Fixing this issue in https://github.com/osrf/chef-osrf/pull/212

Crola1702 commented 1 year ago

This is not completely closed. As we are still missing the new images for build.ros2.org

Crola1702 commented 1 year ago

Finally fixed by: https://github.com/osrf/osrf-terraform/pull/147