ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.46k stars 1.24k forks source link

CI build times out #3189

Closed ruffsl closed 1 year ago

ruffsl commented 1 year ago

Looks like the current CI build is timing out, again. Even with previous countermeasures in place, it appears that these mitigations are no longer sufficient to cut the overall build time to under the maximum one hour limit when building from scratch, or when caching is intentionally busted due to dependency and system package upgrades. For example:

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7903/workflows/d5454092-e664-417b-ab70-b1c05d5c42e7/jobs/27157

The release_build job linked above demonstrates a context deadline exceeded while attempting to finish build the final workspace package. When caching is used, including ccache and colcon-cache, building the nav2 workspace normally has no such issue. However, when caching is deliberately busted to insure clean hygienic builds upon upstream dependency updates, it seems the current codebase exceeds the deadline limit with respect to CircleCI's free tier jobs.

For reference, resource utilization for the medium docker resource class is almost fully maxed out in terms of core count; exceptions include the bottlenecks in building the first few and last packages in the workspace, such as nav2_utils (?) and nav2_system_tests.

image

Possible Solutions

Upgrade resource class for CI workers

By default , the currently resource class used for CI docker workers is medium, with 2 vCPUs and 4GB of RAM. Aside from the occasional spike once or twice a month when the caches are busted, this hasn't been much of an issue since caching was added:

image

Given the current issue, one solution would be to simply upgrade the resource class for workers to something more powerful such as large, with 4 vCPUs and 8GB of RAM:

https://circleci.com/docs/configuration-reference#resourceclass

image

See here for such an example:

However, in practice, switching to a largeer resource class is not enough, given the bottlenecks described above. Despite the principle of diminishing returns, I am also surprised in how unaffected the build time is; perhaps the windows of time where make jobs exceed the former limit of 2 is countered by the greater degree of context switching between make processes? Note: at the 47min mark, the step in the CPU plot remains the same as before with the medium resource class.

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7900/workflows/2812b4a8-00d6-43e3-ba91-86fd378a5a21/jobs/27151

image

In addition, unsetting all makeflags and config options limiting the parallel job count still results in RAM memory exhaustion:

https://app.circleci.com/pipelines/github/ros-planning/navigation2/7901/workflows/a7d01a5e-804c-4250-9069-cd637afe2cfd/jobs/27168

image

Optimize codebase to improve build time

Alternatively, we may want to consider optimizing the nav2 codebase to improve build time. One example could be to split larger packages, such as nav2_system_tests into multiple. While this greater granularity could benefit from greater cache retention, for jobs when building from scratch, if parallelizable, it could also help avoid the largest bottleneck in the build test pipeline.

Perhaps we could also ask the MoveIt2 matainers, who manage packages of similar scale and complexity, in how they are keeping their own build time under in check. For example:

cc @SteveMacenski @tylerjw

SteveMacenski commented 1 year ago

I'm not seeing timeouts, but I am seeing a ton of PRs failing on access denied trying to access the cached workspace steps. May that be related?

The packages I see with large build times are (just ran a clean build locally to get some current numbers):

Start to end, about 33 minutes on my computer. Do you happen to know a good method other than commenting out cmake and building different sections of packages to see where the build time is coming from? I have not had to try to optimize build times before with such granularity.

Something easy to try is reducing the build flags for optimization levels for CI, that should help. We don't need -03 for CI, for instance, even the lowest level of optimization would be fine.

I assume many, small libraries are better than single-large ones. Though, in the behavior tree package, each is its own library and in the costmap package the layers are all grouped together, so maybe not the core of the issue. But also worth a try.

There's also more invasive things we can do like trying to reduce interdependency of components, moving more logic into the headers, and adding forward declarations -- but I suspect the real issue for some of this is the sheer number of things being linked for required dependencies regardless of what the code does. Note that most packages that exceed 1 minute usually involve some kind of pluginlib plugin definitions (and in many cases, templating), with the notable exception of the system tests.

Could we, perhaps move all of the algorithm package builds into a new CI job that build that + the system tests in a different operation in the matrix, so the flow looks like build_framework -> build_algorithms -> test. From these builds, I think moving the plugins into a separate build job would save ~6-10 minutes without even considering the system tests themselves. Or if system tests are the only issue, we could have that be its own build stage.

Does having them in separate jobs as part of the same build matrix help?

ruffsl commented 1 year ago

I'm not seeing timeouts, but I am seeing a ton of PRs failing on access denied trying to access the cached workspace steps. May that be related?

Could be caches are being evicted from CI due to age or storage limits. For PRs with infrequent commits or spans of inactivity, this could be the case. Yet that shouldn't be a problem for the build job steps, where a missing cache would result in falling back to rebuilding the workspace. For test job steps, that is a more serious issue, as test jobs read the caches that the job steps write to immediately prior during the CI workflow. Are the test jobs failing to access the cache?

Do you happen to know a good method other than commenting out cmake and building different sections of packages to see where the build time is coming from? I have not had to try to optimize build times before with such granularity.

I think you could check out Clang Build Analyzer as also used in the MoveIt2 PR linked above. In addition, you may want to limit the number of parallel workers and make jobs to 1 to get a baseline measurement agnostic of core count or background load on your workstation (might take a lot longer though). E.g. setting these colcon/make options to 1:

Something easy to try is reducing the build flags for optimization levels for CI, that should help. We don't need -03 for CI, for instance, even the lowest level of optimization would be fine.

True, I can try that on another experimental PR to gauge its effect.

I assume many, small libraries are better than single-large ones. Though, in the behavior tree package, each is its own library and in the costmap package the layers are all grouped together, so maybe not the core of the issue. But also worth a try.

It could help with cache retention, given a change to one library would be less likely to bust the cache of others, but I'm not sure that would help much when building the workspace from scratch without any caches to begin with.

There's also more invasive things we can do like trying to reduce interdependency of components, moving more logic into the headers, and adding forward declarations -- but I suspect the real issue for some of this is the sheer number of things being linked for required dependencies regardless of what the code does. Note that most packages that exceed 1 minute usually involve some kind of pluginlib plugin definitions (and in many cases, templating), with the notable exception of the system tests.

We have tweaked with the linker before, which really helped cut down on build times. I'm not sure what else we could try there. But to your point about headers, the MoveIt2 PR saw some improvements when switching to be more selective of included headers from rclcpp, but it wasn't too big of a gain alone.

Could we, perhaps move all of the algorithm package builds into a new CI job that build that + the system tests in a different operation in the matrix, so the flow looks like build_framework -> build_algorithms -> test. From these builds, I think moving the plugins into a separate build job would save ~6-10 minutes without even considering the system tests themselves. Or if system tests are the only issue, we could have that be its own build stage.

If you'd like to elaborate on that a little more, I could try it. Splitting the workspace build into multiple jobs even in the same linear workflow could help avoid the context deadline issue from any one single job. What packages would fall into which categories?

Does having them in separate jobs as part of the same build matrix help?

We can easily parallelize test jobs across a matrix, like we already do with our RMW vendor tests, or by upping the optional parallelism for job workers. Doing so for build jobs wouldn't be as easy given the DAG of interdependencies across nav2 packages. If the DAG was disjoint, or had two branching subtrees, would could try to split across that separation. Think of the question as: if we could build nav2 across multiple workspaces, what would the DAG for these workspaces look like?

https://github.com/ros-planning/navigation2/blob/34ecee20a1913a71bc345b0438a0081b73159458/.circleci/config.yml#L470

ruffsl commented 1 year ago

I should have tried this earlier, but we can also try increase Context Deadline (Timeout):

https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded-

I'm not sure what the upper limit here is, but it still would be nice to keep nav2 build time under an hour.

SteveMacenski commented 1 year ago

If you'd like to elaborate on that a little more, I could try it. Splitting the workspace build into multiple jobs even in the same linear workflow could help avoid the context deadline issue from any one single job. What packages would fall into which categories?

I think to start, separating the nav2_system_tests build into its own build job would be a good move. That takes 15 minutes in CI by itself. It would also be sensible to have a test job in parallel that do only nav2_system_tests, since that alone takes 39min 33s. I think the rest of CI is pretty quick (maybe 5-10 minutes) and it would be nice to get rapid results for the non-system tests for things like failed unit tests and linters without the delay from system tests.

So what I'd imagine is build_nav -> build_system_tests -> (in parallel) test_nav + test_system_tests. That should break all the jobs each to well below 1 hour

Example of the access denied in build: https://app.circleci.com/pipelines/github/ros-planning/navigation2/7915/workflows/99237b45-1749-4cb9-8cac-7b77fdcfae6c/jobs/27184. I usually only see it in build, not test. I'm not sure (?) if this relates to the timeout issues.

I should have tried this earlier, but we can also try increase Context Deadline (Timeout):

:+1: :+1: I agree, under an hour would be ideal, but I'll take small steps :smile: