ros2 / ci

ROS 2 CI Infrastructure
http://ci.ros2.org/
Apache License 2.0
48 stars 30 forks source link

Add recovery strategy for ` error waiting for container: unexpected EOF` #703

Closed claraberendsen closed 1 year ago

claraberendsen commented 1 year ago

Description

This PR adds the functionality for recovering the agent by restarting it when an specific error occurs on the log. In particular the case that was applied is error waiting for container: unexpected EOF.

claraberendsen commented 1 year ago

@nuclearsandwich I have a draft of the process here. I need some insight on how to test this... I have a separate PR that I think should go first to this one, that adds the necessary plugins for this to work. At the moment I created a test job in the create_jobs.py file to test that the configuration is correct. However I'm not certain on how to build that and run it, since I don't know when and where is this file being run. Would appreciate your insight.

nuclearsandwich commented 1 year ago

I have a separate PR that I think should go first to this one, that adds the necessary plugins for this to work.

Can you link / reference that here. As discussed in the Infrastructure meeting we're blocked on Jenkins plugin upgrades generally but that's an important item to address in the near future.

I need some insight on how to test this...

  1. Get this draft ready for a test deployment from the current branch
  2. Deploy this PR scoped just to test_ci_linux
  3. Manually reconfigure test_ci_linux to echo the expected error string
  4. Run test_ci_linux and validate the behavior
  5. Re-deploy to remove the manual reconfiguration and run an (expected) successful job to confirm that nothing breaks when the job doesn't exhibit the failure.

I'm in favor of just hand-hacking the test_ci_linux job to display the error string and exit, rather than adding a separate test job just for this. The jobs get re-configured on every deploy and I don't think there's currently enough of a justification to create job for doing those kinds of tests.

At the moment I created a test job in the create_jobs.py file to test that the configuration is correct. However I'm not certain on how to build that and run it, since I don't know when and where is this file being run. Would appreciate your insight.

This script is generally run by ROS 2 devs locally after merging a PR and making a change. Running it requires Jenkins admin configuration and a configured python environment. There are details and links to further details in the readme.

claraberendsen commented 1 year ago

Closing this since we have tracked the actual error and this strategy is not the appropriate solution