microsoft / azure-pipelines-tasks

Tasks for Azure Pipelines
https://aka.ms/tfbuild
MIT License
3.5k stars 2.61k forks source link

Service Fabric Powershell - random Exception from HRESULT: 0x80071C57 on connect to cluster #9351

Open radekgala opened 5 years ago

radekgala commented 5 years ago

Environment

Azure DevOps
account: krolldiscovery team: CoreServices build: Orion.Services-1.1_Maint Prereq Check

Agent

Various agents with 2.142.1. Most of them are Windows 2012 R2 and Windows 2016

Issue Description

We are running Service Fabric PowerShell (Version :1.0.19). We encounter random problem with connection to SF cluster. The fail rate is about 10%. Re-deploy without any changes fixes problem

Task logs

tasklog_6.zip

Error logs

2019-01-17T12:39:21.1634047Z Imported cluster client certificate with thumbprint '7796B9B50E603A76FAB50D1312B95EA27E8CB764'. 2019-01-17T12:39:25.1634559Z ##[debug]System.Fabric.FabricException: An error occurred during this operation. Please check the trace logs for more details. ---> System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80071C57 2019-01-17T12:39:25.1634559Z ##[debug] at System.Fabric.Interop.NativeClient.IFabricClientSettings2.SetSecurityCredentials(IntPtr credentials) 2019-01-17T12:39:25.1634559Z ##[debug] at System.Fabric.FabricClient.SetSecurityCredentialsInternal(SecurityCredentials credentials)

tejasd1990 commented 5 years ago

Hi @radekgala, I suppose you are using your own agents and not Hosted, is that correct? Also, if so, are multiple builds/releases executing the task on the machine in parallel? Thanks

radekgala commented 5 years ago

All agents are own agents. We are not using Hosted Agents because our SF clusters are not public. Usually physical host has 4 VSTS Agents installed and it possible that few releases to different environments are running on the same physical host in parallel. Do you think that leaving 1 active VSTS Agent/physical machine on make sense? I can make such experiment.

tejasd1990 commented 5 years ago

Yes. We could not reproduce the issue, so it would be helpful to know whether you hit the same issue with 1 agent/machine.

radekgala commented 5 years ago

I made experiment and it help. 3 releases each about 50 environments. Error disappeared with 1 agent/machine.

Are you going to fix it? We are not able to limit 1 Agent/machine. We have to many releases.

tejasd1990 commented 5 years ago

Hi, we were able to repro the issue with parallel executions of the tasks on multiple agents simultaneously. It looks like race condition while connecting to the same cluster. I can think of a couple of approaches to solve this. One is that we can place a checkbox in the task input parameters. When this option is checked, the task won't connect to the cluster(this action also adds the service endpoint certificate in the user's certificate store. The user is what the agent is running as.). Also, the task will not clean up, i.e. remove the certificate from the store on task cleanup. Then it would be on the task user to install the certificate, connect to the cluster before any task executions, and remove the certificate after all task executions are done. Are you fine with this approach? We can discuss this offline more if you want. Please let us know your thoughts about it. Thanks.

radekgala commented 5 years ago

I will accept any solution which fix my problem :) For now we set property on the one VSTS Agent per hosts and add the demand on the pipelines to run release with one VSTS Agent per host.

bishal-pdMSFT commented 5 years ago

@radekgala you already have a workaround, hence we are planning to take this as an enhancement in in our next wave - which would be starting in April. To re-phrase what @tejasd1990 said earlier, here is how fix would look like to support multiple parallel deployments:

  1. Agent machine should have the SF client certificate pre-installed
  2. Task will have option which will direct it to not install certificate. It will get the cluster endpoint and certificate thumbprint from service endpoint, and then search for the certificate in cert store
  3. After deployment, task will not clean up the certificate Note: User will have to take care of rotating certificate on machine (if needed)
leantk commented 4 years ago

This issue is stale because it has been open for a year with no activity. Remove the stale label or comment on the issue otherwise this will be closed in 5 days