ros-infrastructure / buildfarm_deployment

Apache License 2.0
30 stars 39 forks source link

'virtual memory exhausted: Operation not permitted' on 32-bit Trusty (ROS Indigo) #204

Closed nuclearsandwich closed 5 years ago

nuclearsandwich commented 6 years ago

Since the deployment of the 4.15 kernel and Docker 18.03 we've been seeing failures with the line virtual memory exhausted: Operation not permitted on a significant number of builds.

Google is, obnoxiously, returning very poor search results in both Groups and gmail for this issue. https://groups.google.com/forum/#!searchin/ros-buildfarm-indigo/%22virtual$20memory$20exhausted%22%7Csort:date In my personal inbox I see 167 occurrences since May 17 when the changes were deployed. Some packages like crossing_detector succeeded later but quite a few others have yet to run without incurring the error.

A non exhaustive list of packages currently failing with this error

The first thing to do is reproduce this on a test farm and then try to reproduce it with just the kernel or docker change to try and isolate which exhibits it.

tfoote commented 6 years ago

I dug into http://build.ros.org/job/Ibin_uT32__rail_grasp_collection__ubuntu_trusty_i386__binary/ since it's relatively quick to reproduce. Running it locally I cannot reproduce the memory exhaustion. And it looks like the maximum memory usage is about 5.5% of my 15GB of RAM which less than 1GB so I don't know of any limitations at that level. Here are some plots of my system's memory usage just before, during and after the critical 50% build object.

screenshot from 2018-05-25 14-34-21 screenshot from 2018-05-25 14-34-28 screenshot from 2018-05-25 14-34-36 screenshot from 2018-05-25 14-34-48

I reproduced the build locally with:

mkdir /tmp/release_job
generate_release_script.py https://raw.githubusercontent.com/ros-infrastructure/ros_buildfarm_config/production/index.yaml indigo default rail_grasp_collection ubuntu trusty amd64 > /tmp/release_job/release_job_indigo_roscpp.sh
cd /tmp/release_job
sh release_job_indigo_roscpp.sh

Maybe this could be tried on a build executor to get the right kernel etc.

nuclearsandwich commented 6 years ago

I had a couple of false starts but I have confirmed that this issue is related to the kernel version bump rather than the docker version bump. Which means that the attempted cure for https://github.com/ros-infrastructure/ros_buildfarm/issues/535 is worse than the original issue.

I've still been reproducing with with a run of the full release script and haven't yet tucked into exactly the cause. It's not memory usage as running stress --vm 2 --vm-bytes 3.99G works just fine and stress --vm 2 --vm-bytes 4G` fails with a different error message. So it's got to be something particular to how or what's being allocated by the build process. I wonder if there are glibc interface changes between the trusty libc and the 4.15 kernel?

nuclearsandwich commented 6 years ago

I've been trying to whittle this down to a minimal case. I've been using the rail_grasp_collection package as a test sample because it had a short successful build time according to jenkins. The failure comes when make is running

/usr/bin/i686-linux-gnu-g++ -DROSCONSOLE_BACKEND_LOG4CXX -DROS_PACKAGE_NAME=\"rail_grasp_collection\" -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -DNDEBUG -D_FORTIFY_SOURCE=2 -I/tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/include -I/opt/ros/indigo/include -I/usr/include/eigen3 -o  MakeFiles/rail_grasp_collection.dir/src/GraspCollector.cpp.o -c /tmp/binarydeb/ros-indigo-rail-grasp-collection-1.1.9/src/GraspCollector.cpp

But interestingly, running that line on its own doesn't seem to cause issues.

If I strace -f the make process I eventually get the following mmap failure which leads to the failure:

[pid  1511] mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 EPERM (Operation not permitted)

But if I compare the possible EPERM reasons from mmap(2):

EPERM: The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.

EPERM: The operation was prevented by a file seal; see fcntl(2).

Neither fits as there is no file backing the mmap. I tried just invoking mmap a handful of times in a separate container and at one point was able to reproduce the issue but now just get Cannot allocate memory after ~4k allocations as I'm not munmapping anything.

nuclearsandwich commented 6 years ago

I've got strace output from a successful run via shell and from a failed run via a minimal makefile (contains the command invocation hard-coded, no other targets or variables).

There's not really much I can identify except that the successful run has more munmaps immediately after mmap2 calls than the one via make does. but nothing is definitively identifiable.

gavanderhoorn commented 6 years ago

Would it make sense to open a ticket on the Docker tracker?

Lots more eyes over there.

nuclearsandwich commented 6 years ago

Would it make sense to open a ticket on the Docker tracker?

I don't really see this as a docker issue. It happens for both Docker 17.05 and 18.03/18.05. It's changing the kernel that causes grief. If anything I might post on the Docker forums and/or reply on https://www.reddit.com/r/docker/comments/8l539q/docker_virtual_memory_running_out/

I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.

gavanderhoorn commented 6 years ago

Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.

I'd also really like to have a minimal example that I can share but I suppose in the context of Docker just publishing an image that exhibits the problem is sufficient even if it's heavyweight.

True.

nuclearsandwich commented 6 years ago

Oh, I didn't see a kernel revert mentioned, so I assumed that was not affecting anything.

My apologies. I had a writeup that ended up getting deleted when I found a flaw in my methods.

The issue doesn't exhibit with the default Xenial kernel. I brought up a machine using our agent AMI but partitioned off from the main buildfarm network and downgraded it's kernel back to the default linux-aws 4.4 kernel and the issue was resolved.

I also tried downgrading docker back to the 17.05 version which was previously used but that had no effect on the issue (it still occurred with the 4.15 kernel and did not occur using 4.4).

However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.

gavanderhoorn commented 6 years ago

However the 4.4 kernel's spectre and meltdown mitigations causing performance issues in trusty containers was the reason we rolled out the 4.15 kernel in the first place. So if we cannot resolve this issue (and some issues using libkmod in kinetic builds) we'll have to explore other options for mitigating or living with the performance hit.

Or disable the mitigations?

Not much private info going around on the farm?

Or are there opportunities for leaking data that I'm not aware of?

Provided it's really those mitigations that are causing this, of course.

nuclearsandwich commented 6 years ago

Provided it's really those mitigations that are causing this, of course.

Disabling the mitigations is another way to resolve https://github.com/ros-infrastructure/ros_buildfarm/issues/535 I haven't tested if 4.4 with mitigations disabled exhibits this same issue. But that's a valid test to perform.

I am reluctant to run without mitigations on the public farm primarily because I lack the expertise to be confident that we would not be opening any new significant new attack vectors in doing so.

gavanderhoorn commented 6 years ago

I'm not an expert either, so this is just speculation, but perhaps registering a repository for dev jobs with a malicuous package that exploits either vulnerability to retrieve the jenkins admin pw?

Seems far fetched but that is basically the end-of-the-world-scenario that started this whole mess.

tfoote commented 6 years ago

Yeah, it looks like maybe finding ways to live with the performance hits on Trusty will be better.

The kernel upgrade is also causing regressions in the realsense driver and downstream packages: https://github.com/intel-ros/realsense/issues/388

nuclearsandwich commented 5 years ago

We did end up reverting the kernel change which precipitated this.