intuit / Tank

Tank is a downloadable application that can be used to load test websites
Eclipse Public License 1.0
84 stars 61 forks source link

[SRE-27144] Short-term fix for agents stuck in pending #243

Closed Zakaria-Kofiro closed 1 year ago

Zakaria-Kofiro commented 1 year ago

Short-term fix for agents stuck in pending

As part of this RCA Action Item, a small temporary fix in the agent retry logic has been implemented and tested in QA-Tank.

Agents in pending status will no longer be relaunched by the agent retry logic. These agents have already reported back to the controller and wait for a start command from the controller while in pending status. They will no longer be terminated if the process from agent initialization to receiving this start command takes more than three minutes. Instead, agents that are still stuck in starting status and have not reported back to the controller will be relaunched.

This is a short-term solution that has been validated to be working in QA-Tank, and will eventually be overhauled as part of updating agent retry logic. This change also updates existing AgentWatchdog logs with jobId values and adds two new logs to the controller (JobManager) to track agent behavior between registering agents and making the start call.

Reference: Investigating Delay in Agent Start-Up

Example of the fix relaunching an agent stuck in starting to start the job:

Better Relaunch Example

Please make sure these check boxes are checked before submitting

PR review process