ros-infrastructure / buildfarm_deployment

Apache License 2.0
30 stars 39 forks source link

Increasing available memory for release builds [Pinocchio] #232

Open wxmerkt opened 4 years ago

wxmerkt commented 4 years ago

Moving this discussion from email to a GitHub issue.

Background: Some release builds of Pinocchio have recently started failing due to virtual memory exhausted (cf. e.g. here). This is due to the template-heavy nature of the project (https://github.com/stack-of-tasks/pinocchio/issues/1074). We have taken steps to decrease memory required to compile (https://github.com/stack-of-tasks/pinocchio/pull/1079, https://github.com/stack-of-tasks/pinocchio/pull/1077). However, the builds are still failing due to memory exhaustion on 32-bit platforms (https://github.com/stack-of-tasks/pinocchio/issues/1096).

Current situation: On Ubuntu 18.04 with an i7-9850H CPU @ 2.60GHz 64-bit, 16GB memory, compiling Pinocchio (commit: https://github.com/stack-of-tasks/pinocchio/commit/8303d3bdf1997da7a661e349c5e8e4ede5ea9382) with a single job this is my peak use as measured by /usr/bin/time -v catkin build -j1 (suggesting a 4.11 GB peak):

Command being timed: "catkin build -j1"
    User time (seconds): 1951.53
    System time (seconds): 108.42
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 34:13.09
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4112528
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1276
    Minor (reclaiming a frame) page faults: 67073817
    Voluntary context switches: 277496
    Involuntary context switches: 23993
    Swaps: 0
    File system inputs: 917816
    File system outputs: 22088336
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

We would like to get input what we can do to alleviate this issue to continue to be able to release via the ROS buildfarm (we do not need pull request testing).

@tfoote indicated that the current limit per VM is 8GB, but may decrease to 2GB in the future. I assume make jobs are using a single job -j1?

  1. Would blacklisting 32-bit builds be a first step in resolving the current bottleneck?
  2. Is there the possibility for more memory during binary-release-building only?

cc: @tfoote @jcarpent

tfoote commented 4 years ago

Certainly blacklisting 32-bit builds will stop the current failure. I'll not that the 4.11GB peak for the build is something that will be truely a blocker for 32bit systems as that's over the size of the max addressable memory in 32 bit space

We don't have the ability to add more memory for a specific job. And in general using this size of compiler object is a problem for many users. I would also recommend simply separating out your system into slightly more smaller compile units. If your system has enough resources it can compile them in parallel. And on smaller systems you can do one at a time. Whereas with the large build at the moment with -j1 there is no way for someone to use it on a smaller platform.

We had similar problems with pcl early in its development. It was regularly causing crashes on developer machines due to going OOM. It too is also highly templated however by reviewing the include orders we were able to significantly reduce the memory usage to keep it from overwhelming systems.

We run 8GB VMs but they run 4 jobs at a time so our primary specification is 2GB per job. We don't currently enforce it but I would request that you try to respect that. If for example your releases for 2 different platforms ended up on the same executor at the same time it would likely run out of memory as both platforms typically peak at the same time. And there's potentially 2 other jobs as well as our system overhead. We've seen this sort of simultaneous peaking actually taking our executors offline in the past: https://github.com/ros-infrastructure/ros_buildfarm/issues/265

azeey commented 3 years ago

Our amd64 build agents have been going offline due to memory exhaustion when building ros-eloquent-pinocchio, and ros-foxy-pinocchio, which were added in https://github.com/ros/rosdistro/pull/26391 and https://github.com/ros/rosdistro/pull/26391 respectively. @wxmerkt, have you made any improvement on the memory consumption? If not, do you mind reverting the PRs until the memory consumption is below 2GB?

wxmerkt commented 3 years ago

I was hoping that similar to ROS1 they'll finish one build by build after a series of failures. This does not seem to be the case :-(. I am okay with reverting the eloquent/foxy releases.

@azeey, what's the current memory limit for the amd64 buildfarm for ROS2? I saw that some builds hung in the tests but I thought I explicitly disabled tests via a patch as in ROS1 (since we know they exhaust memory). Is there another patch we could use to disable tests?

Regarding reducing memory requirements: When speaking to @jcarpent, I seem to recall that reducing memory requirements for compilation of the Python bindings (expose-aba-derivatives etc.) was not straight-forward or possible. I don't have any insights here, sorry.

cc: @Rascof, the ROS2 releases for Pinocchio are likely to be reverted. Could you perhaps look into the possibility of reducing memory requirements? The alternative would be to release without Python bindings, but that'd be quite limiting.

azeey commented 3 years ago

Thanks for the creating the PRs.

what's the current memory limit for the amd64 buildfarm for ROS2?

The build agents have 8GB, but as @tfoote mentioned, each agent runs 4 jobs in parallel, so a limit of 2GB per job is recommended.

Is there another patch we could use to disable tests?

Unfortunately, I'm not familiar with release process to answer that question.

clalancette commented 3 years ago

@wxmerkt What I've found in the past is that having a lot of code in a single compilation unit tends to be the culprit for excessive memory usage. Thus, the solution usually resolves around splitting the code up into multiple compilation units, running the whole thing serially (i.e. make -j1), and then linking them together at the end. I'm not sure if that will work in this particular case, but that's the avenue that I would explore.

Rascof commented 3 years ago

cc: @Rascof, the ROS2 releases for Pinocchio are likely to be reverted. Could you perhaps look into the possibility of reducing memory requirements? The alternative would be to release without Python bindings, but that'd be quite limiting.

I will be looking for a solution but I don't know if I could be of any help since I don't know in detail the Pinocchio package. I asked if It would work without the python bindings. It would but it will not be useful. I can wait to release the packages dependent on Pinocchio for foxy and eloquent.