microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
32 stars 26 forks source link

`helm` reliability improvements #738

Closed BMurri closed 2 months ago

BMurri commented 2 months ago

Describe the bug helm is proving to not be as reliable as we had hoped. We need to find ways to improve timeouts, retries, or other means of making our deployments more reliable.

Steps to Reproduce Depoy either TES or CoA, or update a deployment. In a fair portion of the deployments, these failures will occur

Expected behavior Anything that could succeed by retrying will be retried some arbitrary number of times. If a way to determine that success will not be possible that would be a nice-to-have.

Additional context

2024-06-19T09:54:20.0640334Z HELM: Release "tesonazure" does not exist. Installing it now.
2024-06-19T09:54:30.4899473Z HELM: Error: 1 error occurred:
2024-06-19T09:54:30.4902335Z HELM:  * Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.tes.svc:443/validate?timeout=10s": context deadline exceeded
2024-06-19T09:54:30.4902788Z HELM: 
2024-06-19T09:54:30.4902955Z HELM: 
2024-06-19T09:54:30.4921424Z HELM: 
2024-06-19T09:54:30.4928399Z HELM: 
2024-06-19T09:54:30.4938635Z 
2024-06-19T09:54:30.4939020Z Exception: HELM ExitCode = 1
2024-06-19T09:54:30.4949153Z    at TesDeployer.KubernetesManager.ExecProcessAsync(String binaryFullPath, String tag, String command, CancellationToken cancellationToken, String workingDirectory, Boolean throwOnNonZeroExitCode)
2024-06-19T09:54:30.4950887Z    at TesDeployer.KubernetesManager.DeployHelmChartToClusterAsync(IKubernetes kubernetesClient)
2024-06-19T09:54:30.4951587Z    at TesDeployer.Deployer.PerformHelmDeploymentAsync(IResourceGroup resourceGroup, IEnumerable`1 manualPrecommands, Func`2 asyncTask)
2024-06-19T09:54:30.4952174Z    at TesDeployer.Deployer.DeployAsync()