osrf / multiarch-docker-image-generation

17 stars 18 forks source link

[wip] Maybe qemu 4.2 will fix armhf and arm64 Focal Nbin job failures #35

Closed sloretz closed 4 years ago

sloretz commented 4 years ago

This is a work in progress PR I'm exploring as a solution to the Noetic Focal armhf and arm64 job failures that started 2 days ago. I'll push just osrf/ubuntu_armhf:focal and see if it fixes the failures for Noetic.

I'm trying a newer version of qemu, though I have little reason to think it will help. I notice the latest version of qemu has some changes in do_semop() compared to 3.1. Maybe something on the build agent's changed, like the kernel version, and a newer qemu version is able to handle this new case?

Random notes about the failure

I don't have a better place to put these at the moment. Maybe I'll move this to a ticket somewhere else when I have an opinion about where that ticket should live.

Server: Docker Engine - Community Engine: Version: 19.03.2 API version: 1.40 (minimum version 1.12) Go version: go1.12.8 Git commit: 6a30dfc Built: Thu Aug 29 05:26:54 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.6 GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb runc: Version: 1.0.0-rc8 GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f docker-init: Version: 0.18.0 GitCommit: fec3683


* @clalancette says the path to investigate is `fakeroot`-> container `glibc` -> `qemu` -> host `glibc` -> `kernel`
* Successful job had container `glibc` version `2.30-0ubuntu3`
* Failing job had container `glibc` version `2.31-0ubuntu6`
clalancette commented 4 years ago
  • ENOSYS is not one of the documented error codes for semop()

I poked at the glibc sources just a bit this morning. It looks like there are two ways glibc can return ENOSYS from semop: either if it thinks the kernel isn't Linux, or if the kernel itself returns ENOSYS. I couldn't find any other ways this could happen.

  • qemu seems to translate syscalls, including semop(), though I wasn't able to find a spot where it would return ENOSYS in the stable-3.1 branch.

Not directly, but these are again translated through glibc and then through to the kernel. Also, I know this may not help directly, but in the master branch of qemu, do_semop can return ENOSYS if semtimedop isn't recognized as a valid syscall.

You can also use the perf tool to try and get an idea if these syscalls are hitting the kernel and returning ENOSYS there. Running something like:

perf trace -e semop

Will tell you about all of the semop syscalls that are happening on the system. If you know the PID of the process you want to trace, you can also add -p <pid>, which will restrict it just to that PID.

I'm not sure if any of these ramblings are helpful, but just some other ideas for you.

sloretz commented 4 years ago

FWIW, upgrading qemu to 4.2 did not fix the issue: http://build.ros.org/view/Nbin_ufhf_uFhf/job/Nbin_ufhf_uFhf__rosbash__ubuntu_focal_armhf__binary/9/

sloretz commented 4 years ago

Closing since it did not fix the issue. Notes moved to #36