microsoft / azure-pipelines-extensions

Collection of all RM and deployment extensions
http://www.visualstudio.com/explore/release-management-vs
MIT License
275 stars 425 forks source link

[BUG]: Ansible task - Persistent Node.exe response waiting during transient network issues between Agent machine and Ansible machine. #1229

Open rikat-ms opened 1 week ago

rikat-ms commented 1 week ago

New issue checklist

Extension name

Ansible

Extension version

0.230.2

Issue Description

We are using Ansible installed on a remote machine, which is accessed by an Ansible task in Azure Pipelines with the “ansibleInterface: ‘remoteMachine’” setting. In this scenario, node.exe initiates, establishes an SSH connection to the remote machine, and executes commands over this SSH connection.

However, we’ve noticed an issue where, if a transient network disruption occurs between the Agent machine and the Ansible machine while node.exe is awaiting a response indicating command completion from Ansible, node.exe continues to wait indefinitely for the response. This persistent waiting ultimately leads to the job being cancelled due to reaching the job’s timeout limit.

This issue needs to be addressed to prevent unnecessary job cancellations and to improve the robustness of the system against transient network issues.

Environment type (Please select at least one enviroment where you face this issue)

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

N/A

Operation system

Windows 10 for the agent / Linux for Ansible

Relevant log output

// Ansible's log
In the remote machine, the commands completed after 35 minutes. (Roughly translated from Japanese to English)
----------
[2024-06-11 15:44:38] [2024-06-11 15:44:38] Job "xxxxx"."xxxxx" completed successfully, Tuesday June 11th 15:44:36 2024, elapsed 0 00:39:14.
----------

// Pipeline log
We can see that the job was just canceled without any errors.
----------
2024-06-11T06:06:28.4122622Z ##[section]Starting: Run ******
2024-06-11T06:06:28.4289370Z ==============================================================================
2024-06-11T06:06:28.4290181Z Task         : Ansible
2024-06-11T06:06:28.4290734Z Description  : This task executes an Ansible playbook using a specified inventory via command line interface
2024-06-11T06:06:28.4291266Z Version      : 0.230.2
2024-06-11T06:06:28.4291712Z Author       : Microsoft Corporation
2024-06-11T06:06:28.4292277Z Help         : [More Information](https://go.microsoft.com/fwlink/?linkid=853835)
2024-06-11T06:06:28.4292835Z ==============================================================================
2024-06-11T06:06:29.2645926Z Trying to setup SSH connection to ***@10.***.***.4:22
.....

2024-06-11T06:06:30.6515686Z
2024-06-11T07:06:22.7701158Z ##[error]The operation was canceled.
----------

// Agent's Worker.log
We can see that node.exe was started and then killed including child processes due to the job cancellation request.
----------
[2024-06-11 06:06:28Z INFO ProcessInvokerWrapper] Starting process:
[2024-06-11 06:06:28Z INFO ProcessInvokerWrapper]  File name: 'D:\agent\selfagent01\externals\node\bin\node.exe'
[2024-06-11 06:06:28Z INFO ProcessInvokerWrapper]  Arguments: '"D:\agent\selfagent01\_work\_tasks\Ansible_6f650d20-9c5d-4cce-ad66-e68742ceadf5\0.230.2\main.js"'
.....

[2024-06-11 06:06:28Z INFO ProcessInvokerWrapper] Process started with process id 17228, waiting for process exit.
.....

[2024-06-11 06:06:44Z INFO JobServerQueue] Stop aggressive process web console line queue.
[2024-06-11 07:06:22Z INFO Worker] Cancellation/Shutdown message received.
[2024-06-11 07:06:22Z INFO ExpressionManager] Evaluating: SucceededNode()
[2024-06-11 07:06:22Z INFO ExpressionManager] Result: False
[2024-06-11 07:06:22Z INFO StepsRunner] Cancel current running step.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Sending CTRL_C to process 17228.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Successfully sent CTRL_C to process 17228.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Waiting for process exit or 7.5 seconds after CTRL_C signal fired.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Ignore Ctrl+C to current process.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Kill entire process tree since both cancel and terminate signal has been ignored by the target process.
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Exited process 17228 with exit code -1073741510
[2024-06-11 07:06:22Z INFO ProcessInvokerWrapper] Finished process 17228 with exit code -1073741510, and elapsed time 00:59:53.7050483.
----------

Full task logs with system.debug enabled

N/A

Repro steps

1) Configure an agent to your machine.
2) Install Ansible in the different machine from the agent.
3) Create a pipeline which uses the self-hosted agent and has the Ansible task like this. It would be better if the command(s) would take longer to complete.
    - task: Ansible@0
      inputs:
        ansibleInterface: 'remoteMachine'
        connectionOverSsh: 'connectionToAnsible'
        playbookSourceRemoteMachine: 'ansibleMachine'
        playbookPathAnsibleMachineOnRemoteMachine: *****.yml
        inventoriesRemoteMachine: 'file'
        inventoryFileSourceRemoteMachine: 'ansibleMachine'
        inventoryFileAnsibleMachineOnRemoteMachine: *****.txt
        args: --extra-vars "*****"
        failOnStdErr: false
      displayName: *****

4) Run the pipeline. 
5) Disconnect between the machines.