ros-infrastructure / ros_buildfarm

ROS buildfarm based on Docker
Apache License 2.0
81 stars 96 forks source link

Abort release builds if already re-queued #864

Closed cottsay closed 3 years ago

cottsay commented 3 years ago

There seems to be a race of some kind regarding the triggering of downstream jobs and the execution of jobs already in the queue. What I've observed is that if a build is in the queue but is waiting for a single dependency to finish building, when the dependency finishes, the build starts executing immediately. The dependency's build, however, triggers another build of that same job that is immediately re-added to the queue.

One theory is that there is some delay between when a new build is "triggered" and when that build is actually enqueued. So the sequence may be:

  1. A is building, while B is waiting for A
  2. A finishes building and requests a build of downstream project B
  3. B is dequeued and begins execution
  4. A's build request for B is processed and B is enqueued again

An alternative strategy to avoid this strange behavior might be to enact some sort of quiet period AFTER a build triggers new downstream builds but before the build is actually regarded as 'complete'.

cottsay commented 3 years ago

I have an example of this on the staging farm. If we look through build logs, we may be able to find more occurrences of this double-build pattern. This build of ros_workspace states that the event that triggered it was the completion of its source job: https://build.test.ros2.org/job/Rbin_rhel_el864__ros_workspace__rhel_8_x86_64__binary/1/. It began executing at the exact time that its only binary package dependency finished. This build of ros_workspace states that the event that triggered it was the completion of that single binary dependency: https://build.test.ros2.org/job/Rbin_rhel_el864__ros_workspace__rhel_8_x86_64__binary/2/

The two builds happen less than 1:30 apart. This rapid re-build might be something we could search for using the Jenkins script console to find out how often this is happening, and how effective this mitigation is.

tfoote commented 3 years ago

I've seen this double building pattern a lot with releases in Kinetic of the relatively long running and highly interlinked jsk packages. There's a ticket #475 with more information and what we believe to be the upstream Jenkins bug that's still unresolved a decade later. This workaround sounds like as good a solution as possible. It might be worth putting a comment into the snippet referencing the ticket and linking to the issues so that some day in the future if it's resolved people will know what it's working around and when it can be removed.

cottsay commented 3 years ago

This is actually happening far more often than I originally thought.

I just deployed brand-new jobs for RHEL on build.ros2.org, which yielded a great opportunity to see how often we're re-building successful jobs. I wrote the following Groovy to count the number of jobs that have a single success, and the number of jobs that have more than one success:

once = 0
more = 0

Jenkins.instance.items.findAll { job ->
  job.name =~ '^Rbin_rhel_'
}.collect { job ->
  successes = job.builds.findAll { Run run ->
    run.result == Result.SUCCESS
  }

  if (successes.size() == 1) {
    once += 1;
  } else if (successes.size() > 1) {
    println successes.size() + ' ' + job.name
    more += 1;
  }
}

println "One success: " + once
println "More than one success: " + more

The initial build isn't done yet, but there are enough builds to yield significant statistics:

2 Rbin_rhel_el864__ament_cmake__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_export_definitions__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_export_dependencies__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_export_include_directories__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_export_libraries__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_export_link_flags__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_include_directories__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_libraries__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_nose__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_pytest__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_python__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_ros__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_target_dependencies__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cmake_version__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_cppcheck__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_index_cpp__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_lint_auto__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_nodl__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ament_pycodestyle__rhel_8_x86_64__binary
2 Rbin_rhel_el864__fastrtps_cmake_module__rhel_8_x86_64__binary
2 Rbin_rhel_el864__mimick_vendor__rhel_8_x86_64__binary
2 Rbin_rhel_el864__nodl_python__rhel_8_x86_64__binary
2 Rbin_rhel_el864__performance_test_fixture__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rcpputils__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rcutils__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ros_environment__rhel_8_x86_64__binary
2 Rbin_rhel_el864__ros_workspace__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rosidl_cli__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rosidl_cmake__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rosidl_runtime_c__rhel_8_x86_64__binary
2 Rbin_rhel_el864__rpyutils__rhel_8_x86_64__binary
One success: 114
More than one success: 31

So that's 114 + (2 * 31) = 176 total successes, and 31 of them were unnecessary rebuilds. So out of 114 + 31 = 145 initial successes, we actually built 176 jobs. 145 / 176 = 1.21, so that's ~21% more builds than we needed. That's quite a lot! I'm inclined to say that this change should be reviewed as an immediate mitigation even if we do plan to pursue a solution in Jenkins itself.

cottsay commented 3 years ago

It might be worth putting a comment into the snippet referencing the ticket

Done in 7a28b17, thanks @tfoote

Another observation: this seems to happen most often with bottleneck packages when there is free executor space. This is supported by the abundance of ament packages in the list above. I also just observed this behavior on a majority of the message packages, which brought the excess up to ~24%.