gruntwork-io / terragrunt

Terragrunt is a flexible orchestration tool that allows Infrastructure as Code written in OpenTofu/Terraform to scale.
https://terragrunt.gruntwork.io/
MIT License
8.1k stars 986 forks source link

Feature Flags, Errors and Exclude #3134

Open yhakbar opened 6 months ago

yhakbar commented 6 months ago

Summary

Provide first class support for feature flags as part of Terragrunt HCL configuration.

Allow for dynamic configuration of behavior in select terragrunt.hcl files based on the presence or absence of feature flags that are set via environment variables and CLI flags.

In addition, update how Terragrunt errors and exckudes are handled to ensure that unstable configuration can be handled gracefully.

Motivation

Terragrunt is frequently used in monorepo contexts, and it lends itself to this in how it segments IAC state into separate directories. One definition of monorepos is a single codebase with multiple independent, but related, projects. By this definition, Terragrunt is very much an IAC monorepo tool. Multiple units of IAC are defined independently, and as a whole, they represent a repository of IAC.

Feature flags are a common way to manage the complexity of a monorepo. They allow for the gradual rollout of new features, the ability to turn off features that are not ready for production, and the ability to manage the complexity of a large codebase.

This is especially important in the context of Terragrunt, where infrastructure is most safely updated when updated in small, incremental changes. In addition, the ability to control how failure is handled in IAC is extremely important. Preventing full resolution of an apply across multiple Terragrunt units because a known flaky unit is failing is not always something that can be remediated by the use of retries, and it can be expensive to do so. Occasionally, it is better to ignore the failure of a known flaky unit and continue with the rest of applies, assuming that the failure is not critical to the overall success of the apply.

An example of such a failure would be a dependency chain where one service is deployed by Terragrunt, and has a url output where the service can be accessed, and another service which uses a dependency block to pass that url into the environment variables of a second service.

In this example, if the first service fails to deploy, the second service will also fail to deploy. However, if the first service is known to be flaky, and the second service is not dependent on the first service being deployed successfully, it is better to ignore the failure of the first service and continue with the deployment of the second service, leveraging the url output from a previous successful apply.

Reasons that a unit might be marked in this way include:

Proposal

Provide a combination of:

  1. A method to signal Terragrunt that a flag is set.
  2. A method to define alternate behavior in Terragrunt when a flag is set.

Proposed Syntax

TG_FLAG_feature_name="value" terragrunt command
# or
terragrunt --feature "feature_name=value" command
# The new configuration block for feature flags
feature "feature_name"{
  default = false # Optionally default it so that you can opt-in or out.
}

# Exclude configurations allowing for dynamically determining when and how to exclude execution of nodes in the Terragrunt graph
exclude {
    if = feature.feature_name.value # Boolean expression that determines if the node should be excluded.
    actions = ["all"] # Actions to exclude when active. Other options might be ["plan", "apply", "all_except_output"], etc
    exclude_dependencies = feature.feature_name.value # Exclude dependencies of the node as well
}

# Configuration block for handling errors.
errors {
    # Retry configuration block that allows for retrying errors that are known to be intermittent
    # Note that this replaces `retryable_errors`, `retry_max_attempts` and `retry_sleep_interval_sec` fields.
    # Those fields will still be supported for backwards compatibility, but this block will take precedence.
    retry "foo" {
        retryable_errors = [".*Error: foo.*"]
        max_attempts = 3
        sleep_interval_sec = 5
    }

    # Ignore configuration block that allows for ignoring errors that are known to be safe to ignore
    ignore "bar" {
        # Specify a pattern that will be detected in the error for ignores, or just ignore any error
        ignorable_errors = ! feature.feature_name.value ? [] : [
            ".*Error: bar.*", # If STDERR includes "Error: bar", ignore it
            "!.*Error: baz.*" # If STDERR includes "Error: baz", do not ignore it
        ]
        message = "Ignoring error bar" # Add an optional warning message if it fails
        # Key-value map that can be used to emit signals to external systems on failure
        signals = {
            safe_to_revert = true # Signal that the apply is safe to revert on failure
        }

    }
}

Examples

The syntax is intended to be flexible enough to support a couple different use-cases that are common when using feature flags.

Dynamic Module Example

Mark a terragrunt.hcl file as having a feature that triggers usage of a new module that is not yet stable. In lower environments, this flag is enabled, and in production, it is disabled.

In addition, if the apply fails, it is safe to revert the apply, and a special error message is logged to the console.

feature "use_service_module_v2" {
  default = false
}

errors {
    ignore "v2_errors" {
        ignorable_errors = feature.use_service_module_v2.value ? ".*" : ""
        message = "Service module v2 is not ready for prod yet"
        signals = {
            safe_to_revert = true
        }
    }
}

terraform {
    source = "<some url>?ref=${feature.use_service_module_v2.value ? "v1" : "v2"}"
}
if [[ "$ENVIRONMENT" == "dev" ]]; then
  export TG_FLAG_use_service_module_v2='true'
fi
if ! terragrunt apply -auto-approve; then
    # STDERR -> FEATURE use_service_module_v2 : Service module v2 is not ready for prod yet
    if "$(jq -r 'safe_to_revert' error-signals.json)" == 'true'; then
        git checkout HEAD^
        terragrunt apply -auto-approve
    fi
fi

In this contrived example, the "v2" tag of the module is not currently stable, however, to encourage continuous integration, the platform team has decided to merge in configurations that can use it when a flag is enabled. In the dev environment, the feature flag is enabled, and in the production environment, it is disabled.

When an apply fails, as is expected, a special message is emitted to STDERR to indicate that the source of failure is due to a failure in a feature flag.

In addition, on error, a special error-signals.json file will be created in the same directory as the terragrunt.hcl file with a payload that the platform team knows will be useful to handle the error intelligently. In this scenario, the logic that's being used here that the team has agreed upon is that if any terragrunt apply fails, revert to the last commit and re-run the apply, if a sate_to_revert entry is found in the error-signals.json for the corresponding terragrunt.hcl file that was applied.

The logic here is definitely not what would work for most organizations to achieve a reliable mechanism for reverting a failed apply. It is merely a demonstration of why authors might want signals emitted on failure.

Unreliable Module Example

Mark a terragrunt.hcl file as being unreliable, and ignore any failures with errors matching Networking Error that might occur when applying it.

# ./unreliable/terragrunt.hcl
feature "unreliable" {
  default = true
}

errors {
    ignore "flaky_network" {
        ignorable_errors = ! feature.unreliable.value ? [] : ".*Networking Error.*"
        message = "Unreliable module failing due to intermittent network error"
    }
}
# ./reliable/terragrunt.hcl
dependency "unreliable" {
    config_path = "../unreliable"
}

inputs = {
    static_input = dependency.unreliable.outputs.static_output
}
$ tree
.
├── reliable
│   └── terragrunt.hcl
└── unreliable
    └── terragrunt.hcl
terragrunt run-all apply

In this example, users are able to mark the terragrunt.hcl file in the unreliable directory as being unreliable, knowing that it predictably produces an error with the message Networking Error that can be safely ignored when re-applied.

The ability to ignore errors in the unreliable module is handy here, as the reliable module reads a static output from the unreliable module that doesn't change much, and uses it as an input.

Examples of modules that can have this kind of relationship include:

The dependent modules can continue to codify their dependency relationships to get access to inputs like the database hostname, which is frequently required to connect to the database, and the cluster ID can be passed to the pod, so that its placement can be targeted to the cluster.

In both scenarios, users might find it convenient to be able to avoid failing to successfully deploy the dependent modules when predictable, intermittent errors occur in the dependency.

When using feature flags to support this kind of functionality, the feature flag can be opted-out, via setting an environment variable like so:

TG_FLAG_unreliable='false'

This allows for platform teams to safely test removal of ignored failures until the feature configuration blocks can be removed (possibly by only disabling the feature in lower environments).

In-progress Module Example

Mark a terragrunt.hcl file as being in-progress, excluding all operations on it until a certain feature is complete. The feature can be manually turned on when developing locally, but is off by default.

feature "in_progress" {
    default = false
}

exclude {
    if = feature.in_progress.value
    actions = ["all"]
    exclude_dependencies = feature.in_progress.value
}

When developing the module locally, use the following flag to activate the module:

export TG_FLAG_in_progress='true'

This is a simple way to allow incomplete IaC work to be integrated into a code-base without requiring that the code be fully mature before merging it in.

Rapid, frequent and incremental integration is the standard in Continuous Integration, and this provides a mechanism for achieving that for large IaC code bases.

In addition, note the exclude_dependencies field being used here, which allows for skipping the dependencies of the module as well. This is useful when building out multiple modules that are dependent on each other, and you want to skip the entire chain of dependencies while a module is in-progress.

Technical Details

Some components that will definitely be impacted include:

  1. HCL parsing to parse the feature blocks.
  2. Error handling mechanisms expanded. error_hooks and retryable_errors already alter behavior of a normal Terragrunt execution on failure, This would be another tool that can change how errors are handled in Terragrunt due to the feature.failure block.
  3. Introduction of a new, specially handled environment variable in TG_FLAG_<feature name>.
  4. Introduction of a new CLI flag in --feature (or maybe --terragrunt-feature).
  5. Additional logic to bail out of executing a single terragrunt command when the feature.skip conditions are met.
  6. Additional logic to remove modules from the list of modules to be executed when executing a terragrunt run-all command when they have feature.skip conditions met.
    1. In certain circumstances, this might result in errors (like if a module that has never been applied is skipped, but a dependency tries to use the outputs of that module). In those circumstances, failure would be considered user error, as there isn't really a better way of handling that.

Press Release

First Class Feature Flags

Terragrunt now has built in support for feature flags, allowing behavior of Terragrunt executions to be altered dynamically at runtime.

Feature flags are a staple of modern DevOps best practices, and using them in Terragrunt will allow you to improve the scalability of your IaC code base.

Use feature flags to support the following, and more:

  1. Safely roll out updates to OpenTofu/Terraform modules used in Terragrunt incrementally.
  2. Prevent intermittent errors in specific Terragrunt modules from impacting your entire repository.
  3. Rapidly and continuously integrate incomplete updates to Terragrunt modules without impacting the stability of the whole repository.

Feature flags are available as of [RELEASE]. To learn more about how to use them, click [here](link to feature flag documentation).

Drawbacks

Some drawbacks of this proposal include:

  1. It further complicates the configurations available in a terragrunt.hcl file. Users have already been encountering terragrunt.hcl files that are too long and difficult to maintain. This added complexity might make terragrunt.hcl files even more difficult to reason about.
  2. It introduces additional complexity in reasoning about how Terragrunt executions are going to take place. Having certain errors ignored or actions skipped in a codebase with hundreds of terragrunt.hcl files might be very difficult to reason about.
  3. It might make maintenance of Terragrunt source code more complicated, as behavior in a Terragrunt execution has to be altered very early on in the case of the exclude logic, during execution of the module if the enabled status of the feature is used in controlling behavior, and if failure logic is used to handle failure.

Alternatives

  1. Not doing this at all. Users are able to get some of this functionality simply by reading environment variables with get_env, and adding custom logic to adjust behavior of executions based on the values of the environment variables.
  2. Adding an ignored_errors companion to the retryable_errors that just ignores errors instead of retrying them. Customers have been asking for functionality like this to support handling both of failures that are not intermittent enough that they might recover from retrying over a short duration, and to handle errors in modules that are computationally or temporally expensive to just retry soon after failure.
  3. Provide a special place in Terragrunt documentation to demonstrate how all this functionality is achievable through current tooling like get_env and run_cmd. Provide nice walkthroughs on how to achieve common feature flag patterns with existing tooling in Terragrunt.

These alternatives, while less expensive than undertaking the introduction of net new functionality in Terragrunt, were considered less beneficial, as first class support for feature flags is generally something that makes a good match for Terragrunt, in my opinion.

Option #2 is also not necessarily mutually exclusive. It might be a good idea to pursue that anyways.

Migration Strategy

None

Unresolved Questions

See the section above about the syntax of feature flags.

I also am not sure how expensive this functionality would be to implement and maintain.

Would the community be interested in this functionality, or would they be more interested in any of the alternatives?

References

Proof of Concept Pull Request

N/A

Edits

  1. Typo in the Motivation.
  2. Heading formatting.
  3. Added reference to feedback response.
  4. Rewrote body of proposal to align with feedback response, showing three configuration blocks for feature, skip, and errors. In addition, the proposal now includes some logic for skipping dependencies.
  5. Renamed title of RFC and added more context to the initial preamble.
  6. Renamed skip to exclude, there is already skip attribute in HCL
  7. Renamed skip_dependencies to exclude_dependencies to match naming convention
brikis98 commented 6 months ago

This is a terrific write-up. I love the example use cases, press-release preview, the drawbacks to the design, and the alternatives considered. Well done @yhakbar 👏

Some feedback in a somewhat random order:

  1. I really like the feature flag concept. That is, the ability to have a terragrunt.hcl file that, even if not done, is safe to check in because the feature flag is "off by default." Whatever the outcome of this RFC, I'd very much like to see us support this functionality in some way or another.
  2. A key property of feature flags is that you can turn them on or off without having to deploy new changes. The ideal version is that you write your code, check it in, and let your CI/CD pipeline deploy it. Then you can use a web UI to turn the feature flag on or off at any time, without having to re-run any sort of pipeline. Currently, this proposal is designed around CLI flags and env vars, so (a) there's no web UI to turn things on or off and (b) instead, you probably have to re-trigger your whole deployment pipeline via some sort of manual commit. This becomes a whole lot more powerful if you could seamlessly integrate it with actual feature flag tools, including open source ones (e.g., growthbook and Flagsmith[https://www.flagsmith.com/]) and SaaS ones (e.g., Split and ConfigCat), so there is a clear story around how you could change the value in the UI, and then have the feature turn on automatically. Any thoughts?
  3. There's a bit of mixing of concerns here. Once concern is the feature flag concept, where you can turn individual modules on or off. Another concern is handling intermittent failures and flaky modules. The two concepts are certainly related—e.g., you may want to use a feature flag to turn off a flaky module—but they are not quite the same. I wonder if we'd get a cleaner design if you considered each concern separately, and tried to come up with an ideal solution for each?
  4. Another thing that occurs to me is that you may want to use feature toggles with features within a module, rather than just turning an entire module on or off. Having the feature flag at the module on/off level feels like a side effect of mixing the concern with error handling. If you think about feature flags in isolation, you realize that what you might want is to set various input parameters for a module to different values (e.g., "A" or "B") based on a feature flag, or, perhaps even more likely, set different source URLs based on a feature flag: e.g., in dev, use ref=v2 of a module, but in prod, use ref=v1. Perhaps if you think about the feature flag functionality in isolation, it's actually a function we add (feature_flag(NAME)), which you can use in various conditionals, rather than a block?

    # Example of a feature flag to pick the version of a module to deploy
    terraform {
      source = "github.com/foo//bar?ref=${feature_flag("FOO") == "A" ? "v2" : "v1"}"
    }
    
    # Example of a feature flag that determines error handling
    ignore_errors = feature_flag("ENABLE_MODULE_FOO") == true ? [] : [".*"]
  5. Another key thing to think through is how dependency works if a module is disabled by a feature flag, or hits an error. What output values do the modules that depend on it get?
yhakbar commented 6 months ago

Responding to @brikis98 :

Feature Flag Dynamicity

This design did assume that it would be fully compatible with usage of external web services for feature flag management!

I wanted to focus on the core functionality of how the feature HCL block would be defined and how it would be used during Terragrunt execution for the RFC, but I should have documented this further.

I would guess that the majority of users leveraging the feature flag functionality proposed here would be setting and adjusting environment variables dynamically in their CI/CD pipelines like GitHub Actions, GitLab CI, Jenkins, etc. Prioritizing the ability to toggle feature flags via environment variables and CLI flags was a way to ensure that the feature flag functionality could be used in a wide variety of CI/CD environments, without relying on an external service.

e.g. In the context of a GitHub Actions workflow, configuration like the following would allow for the use_service_module_v2 feature to be toggled on or off without any changes to the codebase using GitHub variables:

env:
    TG_FLAG_use_service_module_v2: ${{ vars.TG_FLAG_use_service_module_v2 }}
run: terragrunt apply -auto-approve

Now, for users who are currently using a feature flag management service, I think the current design does not preclude them from using it. There are two ways that I would expect users to use the feature flag functionality as currently proposed in conjunction with a feature flag management service:

  1. They could still use the service to set the environment variables that are used by Terragrunt to toggle feature flags before invoking terragrunt:

    run: echo "TG_FLAG_use_service_module_v2=$(configcat flag value show use_service_module_v2 --json | jq -r '.value')" >> "$GITHUB_ENV"
    run: terragrunt apply -auto-approve

    I don't really know if that's the right syntax for the configcat CLI, but that's the general idea.

  2. They can reference the feature flag management service directly in the terragrunt.hcl file:

    feature "use_service_module_v2" {
      default = run_cmd("--terragrunt-quiet", "bash", "-c", "configcat flag value show use_service_module_v2 --json | jq -r '.value'")
    }
    
    terraform {
        source = "<some url>?ref=${feature.use_service_module_v2.enabled ? "v1" : "v2"}"
    }

    This would allow users to leverage existing capabilities of Terragrunt to dynamically set feature flag values without reliance on any configuration in a CI/CD pipeline.

I like the idea of seamless integration with feature flag management services that doesn't require leveraging Terragrunt functionality in a manner this sophisticated, however. If this is commonly done within the community, it might be worth it to prioritize a system for integrating with these services directly. Maybe a plugin system that provides nice interfaces for common feature flag management services?

Mixing of Concerns

I agree that there's definitely tension between the feature flag concept, the error suppression concept and the module skip concept.

The error suppression and module skip concepts do end up constricting the feature flag implementation in such a way that it's not as flexible as folks typically want feature flags to be. Tying it to those concepts requires that the feature flag is boolean to allow for the module to be skipped or not, and that the feature flag is used to determine whether or not to suppress errors. As you described, this prevents usage of string or numeric feature flags.

At the same time, I could imagine users wanting to tightly integrate those concepts, as it might only make sense to suppress particular errors within the context of a feature flag being enabled.

What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping? This would allow for more flexibility in how these concepts are used together, and would allow for more complex feature flag configurations that don't necessarily involve error suppression or module skipping.

So, instead of:

feature "feature_name"{
  default = false # Optionally default it so that you can opt-in or out.
  # Conditions that result in the feature being skipped.
  skip {
    actions = ["all"] # Actions to skip when active. Other options might be ["plan", "apply", "all_except_output"], etc
  }
  # Alter behavior on failure
  failure {
    ignorable_errors = ".*" # Specify a pattern that will be detected in the error for ignores, or just ignore any error
    message = "Flaky feature failing here!" # Add an optional warning message if it fails
    # Key-value map that can be used to emit signals on failure
    signals = {
        safe_to_revert = true # Signal that the apply is safe to revert on failure
    }
  }
}

We could have:

feature "feature_name"{
  default = "A"
}

skip {
    if = feature.feature_name.value == "A"
    actions = ["all"]
}

failure {
    ignorable_errors = feature.feature_name.value == "A" ? [".*"]: []
    message = feature.feature_name.value == "A" ? "Flaky feature failing here!" : "Woah, this feature is supposed to be solid!"
    signals = {
        safe_to_revert = feature.feature_name.value == "A"
    }
}

And folks might just conventionally keep the blocks together within the terragrunt.hcl file.

I worry that this might introduce quite a bit of complexity to the configuration, but it might be worth it for the added flexibility. It would allow for the values of feature flags to take on more complex values, and for the other concepts to be used outside of the context of feature flags.

Feature Flags as Functions

I like the idea of not needing additional configuration blocks for feature flags, and instead using them as functions that can be used in various places in the configuration. I don't know if one would end up being more expensive to maintain than the other, so it might be worth preferring the cheaper option.

There may be advantages to having the feature flag defined via a block, however.

e.g. It might be easier to see all of the feature flags that are available in configuration at a glance:

feature "feature_name"{
  default = "A"
}

terraform {
   source = "github.com/foo//bar?ref=${feature.feature_name.value == "A" ? "v2" : "v1"}"
}

Might be easier to spot than:

terraform {
  source = "github.com/foo//bar?ref=${feature_flag("feature_name") == "A" ? "v2" : "v1"}"
}

That would be especially relevant when searching for feature flags to remove once features are stable. This might even lend itself to a terragrunt feature ls command that could be used to list all of the feature flags that are available in a configuration, terragrunt feature rm to remove a feature flag, terragrunt feature evaluate to evaluate all feature flags with current context, etc (though I don't think those are features that would be necessary for initial implementation).

There is also functionality that could be added to the feature block that would be difficult to add to a function. For example, configuring a default value for a feature flag might be more likely to be consistent when done via a block than when done via a function.

e.g. To keep a default value consistent across all uses of a function, you might have to do something like:

locals {
    do_experiment = feature_flag("DO_EXPERIMENT", false) # Where the second argument is the default value

    value1 = local.do_experiment ? "A" : "B"
    value2 = local.do_experiment ? "C" : "D"

    # Because a different default is used here, it's harder to reason about the value of the feature flag
    value3 = feature_flag("DO_EXPERIMENT", true) ? "E" : "F"
}

Whereas, with a block, it's a lot more explicit:

feature "do_experiment"{
  default = false
}

locals {
    value1 = feature.do_experiment.value ? "A" : "B"
    value2 = feature.do_experiment.value ? "C" : "D"

    # Here we're explicitly negating the value of the feature flag, the default can't vary between uses
    value3 = !feature.do_experiment.value ? "E" : "F"
}

Having a block also allows for more complex feature flag configurations in the future, like the ability to configure a provider for integration with a feature flag management service or to have validations, etc.

dependency Interaction

Dependency interactions may be confusing to users, and that interaction should definitely be documented well.

Feature Flags

If we split up the feature flag, error suppression, and module skip concepts into separate configuration blocks, the dependency block won't have any special interaction with feature flags. Feature flags will just be a way of signaling values to Terragrunt modules, and the dependency block might just experience a difference in the outputs it extracts from the module.

Error Suppression

Error suppression on the other hand will result in special interaction.

On the initial apply of a module, if that module fails, no outputs will be available for the module that depends on it. As long as the module is syntactically correct, the module that depends on it will see the equivalent of an un-applied module. If, for whatever reason, the dependent module is mocking outputs for the dependency block, with a mock_outputs_allowed_terraform_commands including apply, the dependent module will use those mocked outputs. This means that the dependent module may still fail to apply if the outputs present are not sufficient to satisfy the dependent module's configuration.

On subsequent applies to a successful apply, failure of the dependency module will result in the dependent module using the outputs from the last successful apply (again, assuming the .tf files are syntactically correct for the dependency module).

Hopefully, this behavior is consistent with what users would expect, as we would just be swallowing the exit code of the dependency module and interacting with it as if it had succeeded.

Module Skip

Module skip will also result in special interaction.

The first special interaction is that the skipped module will be pruned from any dependency graph that a dependent module is a part of. A run-all that includes both a skipped module and a module that depends on it will result in only the dependent module being run. This interaction was the motivation behind allowing the skip block to indicate which actions should be skipped. If a user knows that the module should only be skipped for apply actions, they can specify that in the skip block, and still have it pulled into dependency graphs for other actions like plan and output.

Hopefully, that won't be too complicated. One aspect that might be tricky is that a dependency graph that includes a skipped module in the middle might be more complicated to construct than one that doesn't.

e.g.

A -> B -> C

Skip B

A -> C

The second special interaction is that the dependent module may not have access to outputs from the dependency module if certain actions are skipped. If the output action is skipped on the dependency module, the dependent module will have to rely on mock outputs, not read any outputs, or fail. If the apply action is skipped, the dependent module will have to rely on the outputs from the last successful apply, not see any outputs because the dependency module never applied or fail.

This kind of dynamic behavior might be difficult to reason about, so it may be important to signal carefully when a module is being skipped with a special yellow message in STDERR.

No New Mocks

An alternative to the behavior described above would be to allow for the use of new mocks when a module is skipped or fails. I would advocate against this, as existing mocks are already a source of confusion for users, and adding new sources of mocks may be a significant footgun that isn't worth having.

The fact that all of these are defined as configuration blocks does leave them open to extension in the future, however. There may be sufficient demand for behavior like emitting mock outputs when a module is skipped or fails that it would be worth adding in the future.

brikis98 commented 6 months ago

Feature Flag Dynamicity

I think the env-var and run_cmd driven approaches you listed both make sense, but the most important use case that's missing, IMO, is "go into the feature flag service, click a button, and now the feature is enabled." That's the bread and butter of using feature toggles, after all: click something, and a feature is enabled or disabled. So I just want to make sure there is some reasonable story around how to set that up. E.g., If the feature toggle service has a webhooks API, maybe you add a webhook that triggers GH Actions to re-run terragrunt apply with the latest feature flag value.

Note that I'm only looking for guidance here; not first-class features built into TG itself. At least, not at this stage. If this somehow becomes super popular, sure, we can think about native support in plugins or whatever, but for now, I just want to make sure that if we say "TG supports feature flags," that we support it's most common use case, which is enabling/disabling features with a click in a UI.

What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping?

I'm a big +1 on that. I think we'd want to iterate on exactly what the blocks are, but having these as separate entities seems much more powerful, maintainable, understandable, etc.

There may be advantages to having the feature flag defined via a block, however.

Your analysis is convincing. The block approach wins, hands-down, for helping with readability, understanding, and static analysis/commands based off feature toggles.

The second special interaction is that the dependent module may not have access to outputs from the dependency module if certain actions are skipped. If the output action is skipped on the dependency module, the dependent module will have to rely on mock outputs, not read any outputs, or fail. If the apply action is skipped, the dependent module will have to rely on the outputs from the last successful apply, not see any outputs because the dependency module never applied or fail.

This kind of dynamic behavior might be difficult to reason about, so it may be important to signal carefully when a module is being skipped with a special yellow message in STDERR.

I think making it clear what the behavior will be when a module is skipped (or fails and the failure is ignored). If we use mock outputs or skip or whatever else, we need to make sure it's clear and expected for the user. Maybe even some sort of "use last known in case of skipped or failed dependency" setting, where we use the last known good outputs? Not sure on this, but again, clarity is king here :)

yhakbar commented 5 months ago

To address some feedback that has been brought up regarding this RFC:

"Headless" IAC Updates On Feature Toggle

Some folks have asked about the ability to have feature flag updates triggering infrastructure updates, similar to @brikis98's suggestion above.

The envisioned behavior would be something like updating a feature flag in feature flag management software, then having an event dispatched to drive an infrastructure update without having to manually run another Terragrunt update (I'm calling that a "headless" IAC update, but it will likely be called something different if implemented).

This might be delivered as any of:

While a feature that users would likely appreciate, it is out of scope for this RFC. The primary goal of this RFC is to provide a convenient mechanism for exposing dynamic runtime behavior configuration in Terragrunt, not to provide a way to trigger infrastructure updates based on feature flag changes.

This is something we can revisit at a later date after feature flags are released.

Error Handling

Feedback has also been provided that the mechanism here for handling error suppression may be too simplistic.

For edge nodes in the DAG, it is likely sufficient behavior that errors can be optionally ignored, and for the status code of the entire run-all operation to be set to 0.

For nodes within the middle or start of the DAG, users may want to handle errors in a more nuanced way.

If a node in the middle of the DAG fails, users may want to stop or change the execution of the rest DAG. This may be because the failing node is known to be flaky, and that certain errors in its execution can be safely ignored, but that execution of the rest of the DAG should be stopped if that node fails, as there may be no point in continuing.

As such, while keeping the behavior of error suppression the same (i.e. the rest of the DAG will continue executing if the previous node fails), some additional configuration will be proposed as to how dependency blocks work so that they can opt-in to more nuanced behavior in response to error suppression.

Expose Error Handling Configuration

The proposed failure block will be renamed to an errors block, and will be the future home of all error handling configuration, including retryable_errors, retry_max_attempts and retry_sleep_interval_sec (the existing fields will be preserved for backwards compatibility, but will be overwritten if defined in both places).

In addition, the proposed skip block will include an additional field skip_dependencies, which will allow users to configure whether or not children of a given node should be skipped.

The dependency block will be updated to include the ability to read status information from the dependency it pulls from, based on these configurations and the outcome of execution in the dependency.

e.g.

# ./parent/terragrunt.hcl
errors {
    # Errors of type foo are retryable, and should be retried up to 3 times with a 5 second sleep interval
    retry "foo" {
        retryable_errors = [".*Error: foo.*"]
        max_attempts = 3
        sleep_interval_sec = 5
    }
    # Errors of type bar are ignorable, and should be ignored
    ignore "bar" {
        ignorable_errors = [".*Error: bar.*"]
        message = "Ignoring error: bar"
    }
    # Errors of type baz are ignorable, and should suppress the rest of the DAG
    ignore "baz" {
        ignorable_errors = [".*Error: baz.*"]
        message = "Ignoring error: baz"
    }
}
# ./child/terragrunt.hcl
dependency "parent" {
  config_path = "../parent"
}

skip {
    # Skip child if any errors are ignored in the parent
    if = dependency.foo.errors.ignored

    # Skip for any `terragrunt` action except `output`. Important, as dependencies will need to extract output from the parent.
    actions = ["all_except_output"]

    # Skip dependencies if errors are ignored of type baz
    skip_dependencies = dependency.foo.errors.ignore.baz.ignored
}
# ./grandchild/terragrunt.hcl
dependency "child" {
  config_path = "../child"
}

# Does not run if a `baz` error occurred in `parent`, but will if error `bar` was ignored.

The objective here is to provide a more nuanced way to handle errors within the DAG, but to keep behavior relatively predictable.

Trade-offs

One trade-off in this adjustment is that unexpected behavior may occur in the DAG for grandchildren of a node that has suppressed errors. They will have no configuration that indicates that their parent has suppressed errors, but may be skipped if a grandparent has suppressed errors. This is currently the case when a grandparent fails with an error, but we currently emit an exit code of 1, and throw an error in that scenario.

This approach also requires that child dependencies have explicit error handling of ignored errors in parents, which may be very cumbersome for flaky nodes with many dependants. In the scenario that a flaky node has many dependants, it is likely worth making this trade-off, however, as there may be better context for whether a skip is appropriate within a child than in the parent.

This also complicates the API for the dependency block, which already has quite a bit of responsibility, including leveraging mocked values when outputs aren't available in parent nodes.

In addition to complicating the API of the dependency block, it also requires that resolution of the DAG be done in a more nuanced way, as all nodes will need to check if a parent determined that they should be skipped.

denis256 commented 5 months ago

A couple of notes/questions after reading this RFC

I think all introduced blocks should be named, in this way we can see which one was triggered, like:

skip "skip1" {
    if = false
    actions = ["all"]
...
}

skip "skip2" {
    if = true
    actions = ["all"]
...
}
...

failure "fail1" {
...
}

errors "parent_errors" {
    retry "foo" {
...
    }
...
}

Not sure how it will be handled cases when there are multiple "skip" with contradicted flags like:

skip "s1" {
    if = false
    actions = ["all"]
}

skip "s2" {
    if = true
    actions = ["all"]
}

Or such constructions shouldn't be used (in case if blocks aren't named, only one will be allowed)

Usage of nested setup like:

feature "feature_name"{
  skip {
    ...
  }
  failure {
    ...
  }
}

will not be helpful if users want to have feature flags in the parent file and in children (unknown number) different behavior based on feature flags

Part with signals is not quite clear signals / safe_to_revert - looks like a custom "transaction log" which should be generated by Terragrunt and later processed buy 3rd party software.

yhakbar commented 5 months ago

@denis256

The idea was that you shouldn't have multiple skip or errors blocks in a terragrunt.hcl file. That should be considered invalid and throw an error, as complications from conflicting configurations would be problematic. The single errors block can have multiple named retry or ignore configurations, which should be enough to ensure that any nuanced error handling can take place.

When skips occur, a warning should be emitted to stderr that points out which terragrunt.hcl file was skipped because it had a skip configuration, which should be enough information to track down which skip was triggered, if it can only have one, right?

Sorry I wasn't clear about this, but I initially started out the RFC with the configurations for skip and failure existing inside the feature block, but upon receiving feedback, I moved them to their own blocks.

Ya, the signals thing might need more thinking through. That's the general idea. How can something outside of Terragrunt like a CI system, etc look at a failure that happened and take action (like reverting to previous configuration or sending a slack message, etc).

denis256 commented 2 weeks ago

Prepared beta release with support of feature flags

# terragrunt.hcl

feature "run_hook" {
  default = false
}

terraform {
  before_hook "feature_flag" {
    commands = ["apply", "plan", "destroy"]
    execute  = feature.run_hook.value ? ["sh", "-c", "feature_flag_script.sh"] : [ "sh", "-c", "exit", "0" ]
  }
}

Passing feature flags:

terragrunt --feature run_hook=true apply
terragrunt --feature run_hook=true --feature string_flag=dev apply

https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.8-beta2024110601

denis256 commented 1 week ago

Cut a beta release that supports exclude block

# Exclude configurations allowing for dynamically determining when and how to exclude execution of nodes in the Terragrunt graph
exclude {
    if = feature.feature_name.value # Boolean expression that determines if the node should be excluded.
    actions = ["all"] # Actions to exclude when active. Other options might be ["plan", "apply", "all_except_output"], etc
    exclude_dependencies = feature.feature_name.value # Exclude dependencies of the node as well
}

https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.15-beta2024111501

denis256 commented 6 days ago

Released in https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.16 exclude block

Demo: tg-flags-exclude