hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.35k stars 1.75k forks source link

Serverless NEGs support outlier detection but `google_compute_backend_service` gets error when setting one up #15210

Closed WFrancois closed 4 months ago

WFrancois commented 1 year ago

Community Note

Description

Hi!

Google recently announced a preview version of outlier detection for serverless NEGs.

More information here: https://cloud.google.com/load-balancing/docs/negs/serverless-neg-concepts#outlier-detection

Notably it says:

Outlier detection is an optional configuration that can be enabled on a global backend service that has serverless NEGs attached to it. The outlier detection analysis is only available for a global external Application Load Balancer and not for a classic Application Load Balancer. The outlier detection analysis identifies unhealthy serverless NEGs based on their HTTP response patterns, and reduces the error rate by routing some of the new requests from unhealthy services to healthy services.

Based on the type of server error that is encountered, you can use one of the following outlier detection methods to enable outlier detection:

  • Consecutive 5xx errors. A 5xx series HTTP status code qualifies as an error.
  • Consecutive gateway errors. Only 502, 503, and 504 HTTP status codes qualify as an error.

This feature would make the existing outlier_detection field available for EXTERNAL_MANAGED backend_service. However it shouldn't send field such as enforcing_success_rate.

This would allow to limit downtime on serverless services, something that was previously restricted to non-serverless compute.

New or Affected Resource(s)

Potential Terraform Configuration

The existing synthax used for INTERNAL_SELF_MANAGED can be used for this:

resource "google_compute_backend_service" "default" {
  name        = "backend-service-name"
  description = "Description"

  protocol    = "HTTP"
  port_name   = "http"
  timeout_sec = 30

  load_balancing_scheme = "EXTERNAL_MANAGED"

  outlier_detection {
    consecutive_errors = 2
  }
}

References

b/313874549

melinath commented 1 year ago

It looks like all the fields in the proposed configuration are already available on the resource; what prevents this feature from being used?

WFrancois commented 12 months ago

Hey!

So the issue is that when we set a single value, it's sending too much data.

For example if you take this

resource "google_compute_backend_service" "managed" {
  name        = random_id.backend_service_managed[0].hex
  description = local.client_name

  protocol    = "HTTP"
  port_name   = "http"
  timeout_sec = 30

  load_balancing_scheme = "EXTERNAL_MANAGED"

  outlier_detection {
    consecutive_errors = 5
  }

  backend {
      group = backend.value.id
  }
}

When adding the "outlier_detection", here's the diff it actually generates:

  # google_compute_backend_service.managed will be updated in-place
  ~ resource "google_compute_backend_service" "managed" {
        id                              = "projects/*/global/backendServices/8d731e3e-b068-4d03-b1c8-003627825577--d4687a66"
        name                            = "8d731e3e-b068-4d03-b1c8-003627825577--d4687a66"
        # (17 unchanged attributes hidden)

      + outlier_detection {
          + consecutive_errors                    = 5
          + consecutive_gateway_failure           = 5
          + enforcing_consecutive_errors          = 100
          + enforcing_consecutive_gateway_failure = 0
          + enforcing_success_rate                = 100
          + max_ejection_percent                  = 10
          + success_rate_minimum_hosts            = 5
          + success_rate_request_volume           = 100
          + success_rate_stdev_factor             = 1900
        }

        # (1 unchanged block hidden)
    }

And so it results in the following error:

ā”‚ Error: Error updating BackendService "projects/*/global/backendServices/8d731e3e-b068-4d03-b1c8-003627825577--d4687a66": googleapi: Error 400: Invalid value for field 'resource.outlierDetection': '{  "consecutiveErrors": 5,  "maxEjectionPercent": 10,  "enforcingConsecutiveErrors": 100,  "enforcin...'. Outlier detection contains fields not supported by Serverless NEGs: enforcingSuccessRate., invalid
melinath commented 11 months ago

Thanks for the additional information! It looks like the subfields inside outlier_detection all have default values, which is why they are getting set even though you're not providing a value. Unfortunately, any Terraform fix here would likely require changing at least the default value for enforcing_success_rate, which would probably be a breaking change.

Alternatively, the API could resolve this by adding support for enforcingSuccessRate to Serverless NEGs but I don't know whether that is possible.

WFrancois commented 11 months ago

So I don't know anything about how this works behind the scene, so unsure if this is doable, but could this only send the enforcing_success_rate (and other fields) depending on the load_balancing_scheme?

ie: if it's EXTERNAL_MANAGED, don't send enforcing_success_rate?

melinath commented 11 months ago

That kind of "logic switch" is probably not something we'd accept - it ends up being confusing in the long run, both for maintainers and users. For example, if a user thinks that they can set enforcing_success_rate on an EXTERNAL_MANAGED backend service, they will have no way to know that Terraform is silently ignoring the value in their configuration and may expect different behavior than they experience. (This would especially be an issue if the API later does add support for enforcing_success_rate on EXTERNAL_MANAGED backend services.)

WFrancois commented 11 months ago

Ok yeah makes sense to not accept this, thanks for the explanation!

Any tips on how should we handle that in the meantime if we want to enable this? Because from what I understand, this would mean either wait on the managed backend service to supports the missing attributes, or wait until the next major patch (if the fix is done by then, and if we want to actually implement this fix).

rileygriffith commented 11 months ago

I've also run into this same issue. Has anyone found a workaround for this?

rickihastings commented 10 months ago

I've also encountered this, the only way I've been able to manage these settings in Terraform at all is very hacky but it works as a temporary measure. I've created a new resource using the gcloud module.

module "outlier_detection" {
  source  = "terraform-google-modules/gcloud/google"
  version = "~> 3.0"

  upgrade       = false
  skip_download = true
  platform      = "linux"

  create_cmd_entrypoint = "${path.module}/scripts/enable-outlier-detection.sh"
  create_cmd_body       = "${google_compute_backend_service.default.name} ${var.project}"

  create_cmd_triggers = {
    version = 1 # up this to force the script to rerun
  }
}
#!/bin/bash

set -e

backendService=$1;
project=$2;

gcloud compute backend-services export $backendService --project $project --destination outlier-detection-temp.yaml --global;

cat <<EOT >> outlier-detection-temp.yaml
outlierDetection:
  baseEjectionTime:
    nanos: 0
    seconds: 180
  consecutiveErrors: 5
  consecutiveGatewayFailure: 3
  enforcingConsecutiveErrors: 100
  enforcingConsecutiveGatewayFailure: 100
  interval:
    nanos: 0
    seconds: 1
  maxEjectionPercent: 50
EOT

gcloud compute backend-services import $backendService --project $project --source outlier-detection-temp.yaml --global;
rm outlier-detection-temp.yaml;

You'll also need to add the following to the google_compute_backend_service resource to ensure the changes aren't undone on the next apply.

lifecycle {
  ignore_changes = [outlier_detection]
}

I really hate having to do things like this in Terraform as it often becomes brittle and causes lots of problems so this shouldn't have to be a long term solution.

pawelJas commented 5 months ago

@melinath it seems this this problem is open for almost a year. Removing the default value is the only option to fix the problem (we can't change the API). Should it be tagged as breaking change and removed with the next release?

melinath commented 4 months ago

@c2thorn do you think this would make sense to consider for the upcoming major release?

pawelJas commented 4 months ago

@melinath @c2thorn I am planning to preapre a PR which clears the deafult values next week. Those default values do not add any value from the LB configuration/performance perspective (LB assumes those values if not provided), but they cause problem for LB types which do not support such features. In some sense it is not even a breaking change.

c2thorn commented 4 months ago

@melinath Added it to 6.0.0 and applied breaking-change label

@pawelJas Sounds good, please follow this guide on how to contribute breaking changes to MMv1 if you haven't before. Any changes to default values are considered breaking from Terraform's perspective, even if the resulting backend would effectively not change.

pawelJas commented 4 months ago

The default value for outlier_detection in google_compute_region_backend_service and google_compute_backend_service has been changed removed in version 6.0.0.

c2thorn commented 4 months ago

Closing as this should be fixed as part of the major release next month

github-actions[bot] commented 3 months ago

I'm going to lock this issue because it has been closed for 30 days ā³. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.