Implement Retry for Agents Starting HttpServer
The most common reason for agents not being able to connect to the controller and requiring a restart (terminating that instance and spinning up a new one) observed so far is its inability to start its HttpServer. The cause has been pinpointed: another service is running on that port, and the agent fails to connect and throws an error, causing it to terminate. This has now been fixed to retry the call to start the httpserver, saving time in which the agent will come up (on average 40-60 seconds delay compare to terminating and restarting (4+ mins). This issue rarely occurs per job run, but the likelihood increases as the number of agents started at once increases (most likely will run into this when running 100+ agents at once). This change will allow the job to run with the same amount of agents it first began with, simply waiting for the failing agent to connect instead of terminating it.
Please make sure these check boxes are checked before submitting
[ ] Squashed Commits
[ ] All Tests Passed - mvn clean test -P default
PR review process
Requires one +1 from a reviewer
Repository owners will merge your PR once it is approved.
Implement Retry for Agents Starting HttpServer The most common reason for agents not being able to connect to the controller and requiring a restart (terminating that instance and spinning up a new one) observed so far is its inability to start its HttpServer. The cause has been pinpointed: another service is running on that port, and the agent fails to connect and throws an error, causing it to terminate. This has now been fixed to retry the call to start the httpserver, saving time in which the agent will come up (on average 40-60 seconds delay compare to terminating and restarting (4+ mins). This issue rarely occurs per job run, but the likelihood increases as the number of agents started at once increases (most likely will run into this when running 100+ agents at once). This change will allow the job to run with the same amount of agents it first began with, simply waiting for the failing agent to connect instead of terminating it.
Please make sure these check boxes are checked before submitting
mvn clean test -P default
PR review process