microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
32 stars 26 forks source link

Upgrades may leave the AAD pod identity chart in place if the cluster was temporarily busy #736

Open BMurri opened 2 months ago

BMurri commented 2 months ago

Describe the bug During upgrades (for both CoA and TES deployments), the following message may be observed, after which no attempt to remove the feature will be left in place forever (no future attempt to discover or remove it is ever attempted): HELM: Error: Kubernetes cluster unreachable: Get "http://localhost:8080/version": dial tcp [::1]:8080: connect: connection refused

Steps to Reproduce Steps to reproduce the behavior:

  1. Deploy any ga4gh-tes release v5.2.1 or earlier or CoA release v5.2.0 or earlier.
  2. Upgrade to any release greater than v5.2.1 using the command-line option --DebugLogging true
  3. If the above message had appeared or if the DebugLogging option was not used and the feature was still configured, run the deployer in upgrade mode again (same or any newer release).
  4. Note that the feature remains in place.

Expected behavior If helm is unable to connect to the cluster, retry (some arbitrary number of times). Other messages should continue to be ignored as before. Since there have been multiple releases with this issue in place, a test for the continued presence of the feature should be developed to retry removal of the feature, or feature removal should always be attempted.