As part of this RCA Action Item, a small temporary fix in the agent retry logic has been implemented and tested in QA-Tank.
Agents in pending status will no longer be relaunched by the agent retry logic. These agents have already reported back to the controller and wait for a start command from the controller while in pending status. They will no longer be terminated if the process from agent initialization to receiving this start command takes more than three minutes. Instead, agents that are still stuck in starting status and have not reported back to the controller will be relaunched.
This is a short-term solution that has been validated to be working in QA-Tank, and will eventually be overhauled as part of updating agent retry logic. This change also updates existing AgentWatchdog logs with jobId values and adds two new logs to the controller (JobManager) to track agent behavior between registering agents and making the start call.
Short-term fix for agents stuck in pending
As part of this RCA Action Item, a small temporary fix in the agent retry logic has been implemented and tested in QA-Tank.
Agents in
pending
status will no longer be relaunched by the agent retry logic. These agents have already reported back to the controller and wait for a start command from the controller while inpending
status. They will no longer be terminated if the process from agent initialization to receiving this start command takes more than three minutes. Instead, agents that are still stuck instarting
status and have not reported back to the controller will be relaunched.This is a short-term solution that has been validated to be working in QA-Tank, and will eventually be overhauled as part of updating agent retry logic. This change also updates existing AgentWatchdog logs with
jobId
values and adds two new logs to the controller (JobManager) to track agent behavior between registering agents and making the start call.Reference: Investigating Delay in Agent Start-Up
Example of the fix relaunching an agent stuck in
starting
to start the job:Please make sure these check boxes are checked before submitting
mvn clean test -P default
PR review process