microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
170 stars 134 forks source link

CI/CD pipeline fails semi-randomly, but fails every time #3906

Open TonyWildish-BH opened 3 months ago

TonyWildish-BH commented 3 months ago

Describe the bug The CI/CD pipeline will not run to successful completion in my environment. I've triggered it manually, >25 times, with a fresh configuration every time, and not one run has completed all the steps successfully.

Sometimes it fails during the actual deployment, other times it deploys, but fails during the E2E tests. Failure modes vary, but are normally some variation on 'connection failed' or timeout.

This is in our fork with no change in the code w.r.t. the upstream repository, and no change of anything other than the config file between runs.

I'm sure this is not normal behaviour, but I have no idea how to address it. Everything is happening between github/azure, no on-prem resources. Even the gh CLI is being launched from an azure VM, so it's hard to see how there can be any network issues contributing to this.

I've attached the zip failed.zip of the failed logs, in case that helps.

Steps to reproduce

  1. Populate a config.yaml file with unique values for the TRE id, and the mgmt group, storage account and so on.
  2. make auth to update with new app roles.
  3. Update the secrets/environment variables in the github CICD environment.
  4. Trigger the workflow through the gh CLI.
  5. Wait for the workflow to end, harvest the logfile of failed jobs.
  6. go to 1

Azure TRE release version (e.g. v0.14.0 or main): main, as of April 10th.

Deployed Azure TRE components - click the (i) in the UI: n/a

tim-allen-ck commented 3 months ago

Hi @TonyWildish-BH let me look through the log files and get back to you

SvenAelterman commented 3 months ago

@TonyWildish-BH I wonder if you've ever tried just re-running the pipeline with the same values/secrets?

TRE is a complex deployment and sometimes things "happen" on the Azure side that cause it to fail.

TonyWildish-BH commented 3 months ago

hi @SvenAelterman. Yes, I've done that a few times, and that didn't go through either. In fact, that's why I tried systematically banging away at it, to make sure I had a clean start every time.

As I mention, the errors are quite often of the sort where a retry might help, but I don't expect a pipeline to fail so frequently with that sort of error, so I'm wondering what's going on.

marrobi commented 3 months ago

@TonyWildish-BH when you get to step 6 - can I suggest you go back to step 4 - to see if you get a consistent error that we can then troubleshoot? Do not start at the beginning again each time.

As @SvenAelterman says, there are some things that will fail from time to time, but a rerun of the pipeline usually resolves.

TonyWildish-BH commented 3 months ago

@marrobi, please see my previous comment, I've tried that a few times, and it's never gone all the way through. I'll try again, just for good measure.

However, even if it did work on a retry, that's missing the point. A CI/CD pipeline that doesn't run reliably is broken, and not fit for purpose. I'm trying to determine if the failure is because of something on our side, or if it's because Azure is fundamentally unreliable, or whatever else could be the cause.

I don't really see how it can be on our side, since everything's happening between github and Azure, but I'm open to that possibility. However, both you and @SvenAelterman seem to be telling me that Azure is unreliable, which I hope is not the case.

SvenAelterman commented 3 months ago

I don't mean to give that impression at all. The deployment of Azure TRE is complex with a lot of dependencies and moving parts. Perhaps the TF or pipeline code could be improved to better handle those, et

It's extraordinary, I am sure, to have so many consecutive failures (and in different places nonetheless). However, once the initial deployment is done, subsequent runs of the pipeline are much simpler and much less prone to experiencing issues.

Just curious, have you tried the manual deployment process?

PS: The automated, end-to-end testing performed for pull requests relies on those same pipelines (IIRC), so it's used all the time.

TonyWildish-BH commented 3 months ago

Here's my first set of CI/CD retry attempts, and this is a hard fail I've seen before, running manually. The first pass has an unexpected error creating the database locks, subsequent passes fail because the locks are there, but not imported to Terraform:

Attempt #1:

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.mongo[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
╷
│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to azurerm_management_lock.tre_db[0], provider "provider[\"registry.terraform.io/hashicorp/azurerm\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #2:

azurerm_management_lock.tre_db[0]: Creating...
azurerm_management_lock.mongo[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1

Attempt #3:

azurerm_management_lock.mongo[0]: Creating...
azurerm_management_lock.tre_db[0]: Creating...
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-mongo-***/mongodbDatabases/porter/providers/Microsoft.Authorization/locks/mongo-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.mongo[0],
│   on cosmos_mongo.tf line 49, in resource "azurerm_management_lock" "mongo":
│   49: resource "azurerm_management_lock" "mongo" ***
│ 
╵
╷
│ Error: A resource with the ID "/subscriptions/87ad76be-f07f-4c25-b344-9a37c52d9c66/resourceGroups/rg-***/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-***/sqlDatabases/AzureTRE/providers/Microsoft.Authorization/locks/tre-db-lock" already exists - to be managed via Terraform this resource needs to be imported into the State. Please see the resource documentation for "azurerm_management_lock" for more information.
│ 
│   with azurerm_management_lock.tre_db[0],
│   on statestore.tf line 49, in resource "azurerm_management_lock" "tre_db":
│   49: resource "azurerm_management_lock" "tre_db" ***
│ 
╵
Script done.
Terraform Error
make: *** [Makefile:110: deploy-core] Error 1
Error: Process completed with exit code 2.

I'll clean up and try again, see what happens if I can get past this, which I often do.

tim-allen-ck commented 3 months ago

I've seen the mongo lock error before. Try removing the lock from azure and then rerunning the pipeline to let TF to create the lock.

tim-allen-ck commented 2 months ago

@TonyWildish-BH are you still having this issue?

TonyWildish-BH commented 2 months ago

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

tim-allen-ck commented 2 months ago

Hi @tim-allen-ck. I've abandoned use of the pipeline with no successful resolution, these errors make it unusable for us. We'll have to find some other solution when we come to using the TRE in production.

Hi @TonyWildish-BH we recommend using the deployment repo, this will avoid unnecessary errors with E2E tests.

TonyWildish-BH commented 2 months ago

it's not just about the E2E tests, the hard fail above is well before the TRE is fully deployed. If the deployment repo uses the same CI/CD pipeline then that's not going to help.