hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.46k stars 4.54k forks source link

azurerm_container_app - Cannot deploy container with ingress enabled #20435

Closed btbaetwork closed 6 months ago

btbaetwork commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.3.8

AzureRM Provider Version

3.43.0

Affected Resource(s)/Data Source(s)

azurerm_container_app

Terraform Configuration Files

# this is just the example script taken from https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/container_app with the "ingress"-block added
terraform {
  required_providers {
  }
}
provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West Europe"
}

resource "azurerm_log_analytics_workspace" "example" {
  name                = "acctest-01"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  sku                 = "PerGB2018"
  retention_in_days   = 30
}

resource "azurerm_container_app_environment" "example" {
  name                       = "Example-Environment"
  location                   = azurerm_resource_group.example.location
  resource_group_name        = azurerm_resource_group.example.name
  log_analytics_workspace_id = azurerm_log_analytics_workspace.example.id
}

resource "azurerm_container_app" "example" {
  name                         = "example-app"
  container_app_environment_id = azurerm_container_app_environment.example.id
  resource_group_name          = azurerm_resource_group.example.name
  revision_mode                = "Single"

  template {
    container {
      name   = "examplecontainerapp"
      image  = "mcr.microsoft.com/azuredocs/containerapps-helloworld:latest"
      cpu    = 0.25
      memory = "0.5Gi"
    }
  }

  ingress {
      external_enabled = true
      target_port = 80
      traffic_weight {
        percentage = 100
      }
  }
}

Debug Output/Panic Output

azurerm_container_app.example: Still creating... [9m50s elapsed]
azurerm_container_app.example: Still creating... [10m0s elapsed]
azurerm_container_app.example: Still creating... [10m10s elapsed]
azurerm_container_app.example: Still creating... [10m20s elapsed]
╷
│ Error: creating Container App (Subscription: "7eac1563-570f-4dc5-bcc0-2f057ad0cff0"
│ Resource Group Name: "example-resources"
│ Container App Name: "example-app"): polling after CreateOrUpdate: Code="ContainerAppOperationError" Message="Failed to provision revision for container app 'example-app'. Error details: Operation expired."
│
│   with azurerm_container_app.example,
│   on main.tf line 31, in resource "azurerm_container_app" "example":
│   31: resource "azurerm_container_app" "example" {
│
│ creating Container App (Subscription: "7eac1563-570f-4dc5-bcc0-2f057ad0cff0"
│ Resource Group Name: "example-resources"
│ Container App Name: "example-app"): polling after CreateOrUpdate: Code="ContainerAppOperationError" Message="Failed to provision revision for container app 'example-app'. Error details: Operation expired."

Expected Behaviour

Container App should be successfully provisioned and be publicly reachable using the FQDN

Actual Behaviour

Steps to Reproduce

just apply the config above and wait for it to timeout :-)

Important Factoids

No response

References

No response

xiaxyi commented 1 year ago

Thanks @btbaetwork for raising this issue. I need to find out the minimum timeout for environment creation. Will update once confirmed.

ggeorgovassilis commented 1 year ago

If I may add another data point: container app env is created after about 12min. Container app creation starts immediately after that. I'm creating one container app with the new resource provider:

resource "azurerm_container_app" "containerapp-helloworld" {
  name                         = "containerapp-helloworld-${var.sfx}"
  container_app_environment_id = azurerm_container_app_environment.containerapp-environment.id
  resource_group_name          = azurerm_resource_group.rgcontainers.name
  revision_mode                = "Single"

  template {
    container {
      name   = "simple-hello-world-container"
      image  = "mcr.microsoft.com/azuredocs/containerapps-helloworld:latest"
      cpu    = 0.25
      memory = "0.5Gi"
    }
    min_replicas = 1
    max_replicas = 1
  }
  ingress {
    external_enabled           = true
    allow_insecure_connections = true
    target_port                = 80
    traffic_weight {
      percentage = 100
    }
  }
}

and one the old azapi way:

resource "azapi_resource" "containerapp-apache" {
  type      = "Microsoft.App/containerapps@2022-03-01"
  name      = "containerapp-apache-${var.sfx}"
  parent_id = azurerm_resource_group.rgcontainers.id
  location  = azurerm_resource_group.rgcontainers.location

  body = jsonencode({
    properties = {
      managedEnvironmentId = azurerm_container_app_environment.containerapp-environment.id
      configuration = {
        ingress = {
          external : true,
          allowInsecure : true,
          targetPort : 80
        },

      }
      template = {
        containers = [
          {
            image = "registry.hub.docker.com/library/httpd:2.4"
            name  = "apache-container"
            resources = {
              cpu    = 0.25
              memory = "0.5Gi"
            }
          }
        ]
        scale = {
          minReplicas = 1,
          maxReplicas = 1
        }
      }
    }

  })
  #  depends_on = [azapi_resource.containerapp-environment]
}

The first fails with a timeout (although the resource is created fine), the second succeeds.

jpinsolle-bc commented 1 year ago

Same here, to be more precise, in my case, a creation without the ingress part works. And after that if you update your container app with an ingress it works too (no timeout). Timeout occurs when you send for the first time a definition with an ingress.

btbaetwork commented 1 year ago

Can confirm what @jpinsolle-betclic wrote. Also these timings may be of interest:

Apart from that i noticed the following (maybe related, maybe different error): When the Container App is up and running with the ingress activated (following @jpinsolle-betclic s procedure), every following "terraform apply" will update the already existing app despite no changes to the terraform code have been made:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # azurerm_container_app.example will be updated in-place
  ~ resource "azurerm_container_app" "example" {
        id                            = "/subscriptions/7eac1563-570f-4dc5-bcc0-2f057ad0cff0/resourceGroups/example-resources/providers/Microsoft.App/containerApps/example-app"
        name                          = "example-app"
        tags                          = {}
        # (8 unchanged attributes hidden)

      ~ ingress {
            # (5 unchanged attributes hidden)

          ~ traffic_weight {
              ~ percentage      = 0 -> 100
                # (1 unchanged attribute hidden)
            }
          - traffic_weight {
              - latest_revision = false -> null
              - percentage      = 100 -> null
              - revision_suffix = "nqhtr2u" -> null
            }
        }

        # (1 unchanged block hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

This is not the case when the AzApi-provider is used

@ggeorgovassilis: you wrote that "The first fails with a timeout (although the resource is created fine)" - is there really a running container inside your Container App that is reachable from "outside"? The resource itself looks fine for me also at first glance, but in reality doesn`t run anything...

ggeorgovassilis commented 1 year ago

is there really a running container inside your Container App

@btbaetwork I spoke hastily and with little knowledge. In fact, the container is created but ingres returns 404.

darrencrossley commented 1 year ago

I can also confirm the same findings - creation in two steps (without ingress then adding) is fast ~20-30s in all cases, but creating with ingress enabled leads to timeout after around 10-15mins and deployment failure usually with revision 'unknown' as the status and the whole azure container apps ui is somewhat unresponsive until the app is deleted. On some occasions I had to delete the app via CLI as the portal wouldn't load the app state.

This was on a new empty container apps env (created ~1hr previous) and the container apps env had VNET integration via the infrastructure_subnet_id property.

ThijSlim commented 1 year ago

I got the same issue, with revision mode set to Multiple I do see the container deployed in the Azure Portal, still the revision status is stuck on "In Progress" and terraform timeouts:

resource "azurerm_container_app" "containerapp-helloworld" {
  revision_mode                = "Multiple"
}
kopaka808 commented 1 year ago

I can observe the same behaviour, thanks to this thread I can at least pinpoint it to the ingress block - I was desperately focusing on the registry block, assuming my ACR is not accessible to pull the image from. Thanks for providing this workaround with the 2-step deployment!

Once I include the ingress block during initial deployment of my container_app resource (as opposed to adding it to an existing container_app resource in a separate run), I run into the same timeout mentioned above (ignoring the settings of the timeout block btw), resulting in a Container App that has no container, is not included in my TF state file and therefore can only be deleted manually from the Azure Portal.

trichling commented 1 year ago

Hi all,

I can confirm the bug with the configuration stated above. However I was experimenting with some more properties on the ingress and I could manage to get it working with this ingress block:

  ingress {
    external_enabled           = true
    target_port                = 80
    traffic_weight {
      latest_revision = true
      percentage      = 100
    }
  }

Befor this setup I only provided the required parameters as in @ggeorgovassilis configuration above. After adding latest_revision parameter it suddenly started working again. Also creation was very fast, 4:50 minutes for the environment and just 16 seconds for each container app.

andmos commented 1 year ago

Can also confirm. Creating container_app with ingress block times out, while creating without and then adding the block works.

ggeorgovassilis commented 1 year ago

@trichling thanks, that works. @andmos does "... and then adding the block" mean two deployments? For me it worked with a single deployment with the simple change Tobias proposed.

andmos commented 1 year ago

@ggeorgovassilis ah did not try with more parameters in the block. Will give it a go, I got it working with a minimal block and two runs of terraform apply.

darrencrossley commented 1 year ago

can also confirm this works with traffic_weight block set to:

traffic_weight {
  percentage = 100
  latest_revision = true
}
stewartbeck commented 1 year ago

I can confirm that i'm seeing the behavior. From the logs i see that when it tries to assign the traffic weight, it's not including the revision hash so it fails trying to find the revision.

If you remove the ingress it'll succeed in creation. Then you can modify the terraform and put the ingress back and this time it succeeds, but updating the traffic percentage to 100 will still fail.

Here is a screen cap of the logs when it fails: image

stewartbeck commented 1 year ago

Adding a bit more context: Setting LatestRevision = true works and allows it to successfully set the traffic to 100%.

Looking into the code at in the helpers/container_apps.go you see:

if !v.LatestRevision { traffic.RevisionName = pointer.To(fmt.Sprintf("%s--%s", appName, v.RevisionSuffix)) }

that is why setting latest revision works. Seems the client isnt correctly populating the RevisionSuffix after it gets generated.

franhoey commented 1 year ago

Thank you all, this has moved me on an validated I'm not going crazy, but after adding the latest_revision=false to the traffic_weight, I now get this error

polling after CreateOrUpdate: Future#WaitForCompletion: context has been cancelled: StatusCode=0 -- Original Error: context deadline exceeded

franhoey commented 1 year ago

Today when I've returned to this it's all working, I'm assuming the "context deadline exceeded" was a temporary issue and adding latest_revision = true has solved the issue

ryantk commented 1 year ago

I can confirm setting last_revision = true fixes the above timeout issue for me.

jdubois commented 1 year ago

I also confirm. If you want a complete working example, you can have a look at this code I just finished: https://github.com/microsoft/NubesGen/blob/099ce8616a1b762ff9d0016fcdb13b7e0037ac47/terraform/modules/container-apps/main.tf#L97

JeremyKeusters commented 1 year ago

Can also confirm that this issue is happening because of the ingress block. Big thanks to @trichling for sharing the fix with latest_revision = true.

simonecoppini commented 1 year ago

wow... I have been fighting with some problme from about 4 days now...

I think there are still different problems with this kind of resource.... I confirm all has been said and I add a new problem I thought was about VS 2022 but now I think it is about the terraform module.

I cannot deploy my projet in the app container form VS 2022... I get the error

Failed to push the docker image to your azure container registry for use in your azure container app.

But, the image is correctly deployed to the registry but it is not used in the app container.

In fact I can deploy with no problem the image only to the container registry, and I can deploy with no problem to an app container I created manually... only I cannot deploy to the container app created with terraform.

Anyone has the same problem? How do you solve it?

dhilgarth commented 1 year ago

@simonecoppini This seems unrelated. Please create a new issue for that

sbaia13 commented 1 year ago

Hi, i had the same problem recently. This happen when revision_mode is set to Single. This terraform ressource worked for me :

resource "azurerm_container_app" "example" { count = length(var.environnements-name) name = "keycloak-${var.environnements-name[count.index]}" container_app_environment_id = azurerm_container_app_environment.example.id resource_group_name = azurerm_resource_group.example.name revision_mode = "Multiple" ### Single restrict the deployment to only the ultimate version(revisions) of the container, Multiple allow splitting traffic betwen multipe versions

template { container {

Container definition with environement variables to connect with the Cosmos DB for postgresql

  name   = "keycloak-02"
  image  = "${azurerm_container_registry.example.login_server}/keycloak:latest"
  cpu    = 0.25
  memory = "0.5Gi"

}
## Defining max and min number of containers
max_replicas = 3
min_replicas = 1

}

ingress {

Defining the ingress with the port to use to tagert the postgresql, revision mode must be set to Multiple

transport = "auto"
target_port = 8080
external_enabled = "true"  ## true for testing only
traffic_weight {
    latest_revision = true   ## If true trafic weight routed to the new revision (this parameter is required to set an ingress)
    percentage = 100
    ## Labels can be used to split trafic between multiple revisions for testing purpose
}

}

depends_on = [ azurerm_cosmosdb_postgresql_cluster.example, null_resource.import-image,] }

Hope this will help !

mpereira-ae commented 1 year ago

Another confirmation of latest_revision = true in the ingress block making it work, thanks @trichling!

rcskosir commented 11 months ago

Thanks for taking the time to submit this issue. It looks like this has been resolved as of the suggestion of latest_revision = truein the ingress block making it work. As such, I am going to mark this issue as closed.

stewartbeck commented 11 months ago

This should definitely NOT be closed. Latest_revision should not be required to be true - this is a hack. There is a clear bug in the code where the revision suffix is not getting set correctly.

rcskosir commented 11 months ago

Thank you for the clarification @stewartbeck. I have reopened this issue. I appreciate the quick response.

dvdr00t commented 11 months ago

Wow! Spent the whole day trying to figure out why the Apply took more than 10 mins and then failed out of nowhere. Many thanks @trichling to find that hack! For me (West Europe), adding latest_revision = true as a workaround made the job. Time for Apply also decreases to ~1 min to create the environment and ~ 1 min to deploy two different container apps.

Hopefully this gets fixed soon, the documentation marking the parameter as Optional is definitely misleading.

joeizy commented 11 months ago

FWIW - add me to the list of people who had it "working" then added ingress, thought it was fine b/c it worked, later it broke and took extensive time to figure it out. This bug is evasive and a bit gnarly b/c it's not obvious at the time you make the change that it will later break and when it does break, the error message gives no indication or direction on the issue or how to fix.

darren-rose commented 10 months ago

setting latest_revision = true also resolved the issue for me

js-jslog commented 8 months ago

I'll be one more person reporting exactly the same issue, exactly the same results with the workaround and exactly the same thoughts about what a time sink this problem is. The 10 minute wait time for testing of ideas combined with the non-specificity of the error is really painful. Even if the error message were updated to indicate that this workaround exists that would be a massive improvement. Thanks to the team for what they do.

FWIW - add me to the list of people who had it "working" then added ingress, thought it was fine b/c it worked, later it broke and took extensive time to figure it out. This bug is evasive and a bit gnarly b/c it's not obvious at the time you make the change that it will later break and when it does break, the error message gives no indication or direction on the issue or how to fix.

zioproto commented 8 months ago

it is not clear to me why if revision_mode = "Single" the traffic_weight block is mandatory in the Terraform provider.

However, when creating a Container App with the portal, inspecting the json object I see:

        "configuration": {
            "secrets": null,
            "activeRevisionsMode": "Single",
            "ingress": {
                "fqdn": "test2.xxxxxxxxx-xxxxxx.eastus.azurecontainerapps.io",
                "external": true,
                "targetPort": 8080,
                "exposedPort": 0,
                "transport": "Auto",
                "traffic": [
                    {
                        "weight": 100,
                        "latestRevision": true
                    }
                ],

It seems the traffic block with latestRevision: true is there by default in the object created by the portal

Cc: @lonegunmanb @jiaweitao001

clemlesne commented 7 months ago

Relates to https://github.com/hashicorp/terraform-provider-azurerm/issues/21022, https://github.com/hashicorp/terraform-provider-azurerm/issues/22432, https://github.com/hashicorp/terraform-provider-azurerm/issues/21242, https://github.com/hashicorp/terraform-provider-azurerm/issues/23289.

klemmchr commented 4 months ago

@clemlesne this issue has a regression or wasn't fixed properly in the first place. When omitting latestRevision in a single ingress block the container app is stuck during creation and the operation will be canceled after 10 minutes.

alexdresko commented 4 months ago

@klemmchr I don't know if it has anything to do with azurerm_contaier_app. We're using azapi_resource's Microsoft.App/containerApps@2022-03-01 and started noticing a 10-minute timeout recently when creating container apps.

That being said, the JSON we use to create the container app using azapi_resource _does not have a latestRevision section. Nothing in Terraform Cloud indicates that the problem is related to latestRevision, but I might try to add that and see if it fixes our problem.

github-actions[bot] commented 2 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.