Closed claraberendsen closed 1 year ago
Today two aarch64 nightlies failed with this issue: nightly_linux-aarch64_debug #2384 and nightly_linux-aarch64_repeated #2371. Since the agent didn't get destroyed afterwards the logs were preserved and a detailed investigation was possible.
1. Rule out OOM killer stopping the docker container.
The command journalctl -k
was run to verify that the OOM killer process of the OS triggered either the daemon or the docker process to die. No indication of OOM issues was found in log.
2. Check docker.service logs to have a better indication of what was happening
Run the command journalctl -u docker.service
.In the docker logs we could see that systemd
was causing the docker daemon to restart. See the following lines :
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.277386638Z" level=info msg="Processing signal 'terminated'"
Jun 06 06:44:13 ip-10-0-1-221 dockerd[948]: time="2023-06-06T06:44:13.781644906Z" level=info msg="Daemon shutdown complete"
And then this line
Jun 06 06:44:55 ip-10-0-1-221 dockerd[499016]: time="2023-06-06T06:44:55.802888332Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
Similar logs can be found on the other agent the nightly was run.
Note that the timing of these logs is close to when the error happens in the builds per each agent. This is a good indication that the errors are happening due to the restart of the docker service.
3. Check journalctl logs in general Executing journalctl between the time of failures we can see the following a couple of minutes before:
Jun 06 06:25:01 ip-10-0-1-221 CRON[428020]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 06 06:25:01 ip-10-0-1-221 CRON[428018]: pam_unix(cron:session): session closed for user root
Jun 06 06:34:19 ip-10-0-1-221 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 06 06:35:56 ip-10-0-1-221 dbus-daemon[592]: [system] Activating via systemd: service name='org.freedesktop.PackageKit' unit='packagekit.service' requested by ':1.28' (uid=0 pid=444913 comm="/usr/bin/gdbus call --system --dest org.freedeskto" label="unconfined")
Which triggers a restart of most daemons by systemd. This Daily apt upgrade
log indicates that an unattended-upgrade
is being run.
It would seem that the latest issues with error waiting for container: unexpected EOF
are related to the timing of the unattended-upgrades
and not by a OOM or CPU issue like initially consider.
Even though we cannot rule out that an OOM issue may be causing some of them, it seems there is a periodicity and timing to which jobs fail, affecting mostly nightlies and in the same time.
As an example the nightly_linux-aarch64_debug
has failed on two different agents but on similar times of day in the last two runs #2383 and #2384.
unattended-upgrades
being configured. Today another nightly failed with this error nightly_linux-rhel_debug In the logs we see the time of failure is ...
06:52:51 time="2023-06-07T06:52:52Z" level=error msg="error waiting for container: unexpected EOF"
If we ssh into the instance we see that the unnatended upgrade was run during that time as well
Jun 07 06:40:39 ip-10-0-1-235 systemd[1]: Starting Daily apt upgrade and clean activities...
Jun 07 06:53:35 ip-10-0-1-235 systemd[1]: Finished Daily apt upgrade and clean activities.
aarch builds have been green since the weekend.
Waiting for: https://github.com/osrf/chef-osrf/pull/204
This issue has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-infrastructure-updates-from-june-2023/32258/1
I have encountered the similar issue that says error waiting for container: unexpected EOF. https://build.ros2.org/job/Hbin_uJ64__aws_sdk_cpp_vendor__ubuntu_jammy_amd64__binary/16/display/redirect https://build.ros2.org/job/Rbin_ujv8_uJv8__aws_sdk_cpp_vendor__ubuntu_jammy_arm64__binary/11/display/redirect
@wep21 thanks for reporting! The PR to fix has gotten lost in the weeds of other things but given that the error is appearing again I have re-prioritized it and added to my backlog for this week. This however is most likely unrelated to the general error of Unexpected EOF .
Fixing this issue in https://github.com/osrf/chef-osrf/pull/212
This is not completely closed. As we are still missing the new images for build.ros2.org
Finally fixed by: https://github.com/osrf/osrf-terraform/pull/147
Description
We have been encountering the error of
error waiting for container: unexpected EOF
frequently on ci.ros2.org. Seems to be an issue with memory availability for the container.Example build
https://ci.ros2.org/view/nightly/job/nightly_linux_debug/2664/console
Statistics on the issue for the first two weeks of May:
nightly_linux_debug: 20% of the times nightly_linux-aarch64_release: 13.33% of the times nightly_linux-aarch64_repeated: 13.33% of the times nightly_linux_release: 7.14% of the times