Open yhakbar opened 6 months ago
This is a terrific write-up. I love the example use cases, press-release preview, the drawbacks to the design, and the alternatives considered. Well done @yhakbar 👏
Some feedback in a somewhat random order:
terragrunt.hcl
file that, even if not done, is safe to check in because the feature flag is "off by default." Whatever the outcome of this RFC, I'd very much like to see us support this functionality in some way or another.Another thing that occurs to me is that you may want to use feature toggles with features within a module, rather than just turning an entire module on or off. Having the feature flag at the module on/off level feels like a side effect of mixing the concern with error handling. If you think about feature flags in isolation, you realize that what you might want is to set various input parameters for a module to different values (e.g., "A" or "B") based on a feature flag, or, perhaps even more likely, set different source
URLs based on a feature flag: e.g., in dev, use ref=v2
of a module, but in prod, use ref=v1
. Perhaps if you think about the feature flag functionality in isolation, it's actually a function we add (feature_flag(NAME)
), which you can use in various conditionals, rather than a block?
# Example of a feature flag to pick the version of a module to deploy
terraform {
source = "github.com/foo//bar?ref=${feature_flag("FOO") == "A" ? "v2" : "v1"}"
}
# Example of a feature flag that determines error handling
ignore_errors = feature_flag("ENABLE_MODULE_FOO") == true ? [] : [".*"]
dependency
works if a module is disabled by a feature flag, or hits an error. What output values do the modules that depend on it get?Responding to @brikis98 :
This design did assume that it would be fully compatible with usage of external web services for feature flag management!
I wanted to focus on the core functionality of how the feature
HCL block would be defined and how it would be used during Terragrunt execution for the RFC, but I should have documented this further.
I would guess that the majority of users leveraging the feature flag functionality proposed here would be setting and adjusting environment variables dynamically in their CI/CD pipelines like GitHub Actions, GitLab CI, Jenkins, etc. Prioritizing the ability to toggle feature flags via environment variables and CLI flags was a way to ensure that the feature flag functionality could be used in a wide variety of CI/CD environments, without relying on an external service.
e.g. In the context of a GitHub Actions workflow, configuration like the following would allow for the use_service_module_v2
feature to be toggled on or off without any changes to the codebase using GitHub variables:
env:
TG_FLAG_use_service_module_v2: ${{ vars.TG_FLAG_use_service_module_v2 }}
run: terragrunt apply -auto-approve
Now, for users who are currently using a feature flag management service, I think the current design does not preclude them from using it. There are two ways that I would expect users to use the feature flag functionality as currently proposed in conjunction with a feature flag management service:
They could still use the service to set the environment variables that are used by Terragrunt to toggle feature flags before invoking terragrunt
:
run: echo "TG_FLAG_use_service_module_v2=$(configcat flag value show use_service_module_v2 --json | jq -r '.value')" >> "$GITHUB_ENV"
run: terragrunt apply -auto-approve
I don't really know if that's the right syntax for the configcat
CLI, but that's the general idea.
They can reference the feature flag management service directly in the terragrunt.hcl
file:
feature "use_service_module_v2" {
default = run_cmd("--terragrunt-quiet", "bash", "-c", "configcat flag value show use_service_module_v2 --json | jq -r '.value'")
}
terraform {
source = "<some url>?ref=${feature.use_service_module_v2.enabled ? "v1" : "v2"}"
}
This would allow users to leverage existing capabilities of Terragrunt to dynamically set feature flag values without reliance on any configuration in a CI/CD pipeline.
I like the idea of seamless integration with feature flag management services that doesn't require leveraging Terragrunt functionality in a manner this sophisticated, however. If this is commonly done within the community, it might be worth it to prioritize a system for integrating with these services directly. Maybe a plugin system that provides nice interfaces for common feature flag management services?
I agree that there's definitely tension between the feature flag concept, the error suppression concept and the module skip concept.
The error suppression and module skip concepts do end up constricting the feature flag implementation in such a way that it's not as flexible as folks typically want feature flags to be. Tying it to those concepts requires that the feature flag is boolean to allow for the module to be skipped or not, and that the feature flag is used to determine whether or not to suppress errors. As you described, this prevents usage of string or numeric feature flags.
At the same time, I could imagine users wanting to tightly integrate those concepts, as it might only make sense to suppress particular errors within the context of a feature flag being enabled.
What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping? This would allow for more flexibility in how these concepts are used together, and would allow for more complex feature flag configurations that don't necessarily involve error suppression or module skipping.
So, instead of:
feature "feature_name"{
default = false # Optionally default it so that you can opt-in or out.
# Conditions that result in the feature being skipped.
skip {
actions = ["all"] # Actions to skip when active. Other options might be ["plan", "apply", "all_except_output"], etc
}
# Alter behavior on failure
failure {
ignorable_errors = ".*" # Specify a pattern that will be detected in the error for ignores, or just ignore any error
message = "Flaky feature failing here!" # Add an optional warning message if it fails
# Key-value map that can be used to emit signals on failure
signals = {
safe_to_revert = true # Signal that the apply is safe to revert on failure
}
}
}
We could have:
feature "feature_name"{
default = "A"
}
skip {
if = feature.feature_name.value == "A"
actions = ["all"]
}
failure {
ignorable_errors = feature.feature_name.value == "A" ? [".*"]: []
message = feature.feature_name.value == "A" ? "Flaky feature failing here!" : "Woah, this feature is supposed to be solid!"
signals = {
safe_to_revert = feature.feature_name.value == "A"
}
}
And folks might just conventionally keep the blocks together within the terragrunt.hcl
file.
I worry that this might introduce quite a bit of complexity to the configuration, but it might be worth it for the added flexibility. It would allow for the values of feature flags to take on more complex values, and for the other concepts to be used outside of the context of feature flags.
I like the idea of not needing additional configuration blocks for feature flags, and instead using them as functions that can be used in various places in the configuration. I don't know if one would end up being more expensive to maintain than the other, so it might be worth preferring the cheaper option.
There may be advantages to having the feature flag defined via a block, however.
e.g. It might be easier to see all of the feature flags that are available in configuration at a glance:
feature "feature_name"{
default = "A"
}
terraform {
source = "github.com/foo//bar?ref=${feature.feature_name.value == "A" ? "v2" : "v1"}"
}
Might be easier to spot than:
terraform {
source = "github.com/foo//bar?ref=${feature_flag("feature_name") == "A" ? "v2" : "v1"}"
}
That would be especially relevant when searching for feature flags to remove once features are stable. This might even lend itself to a terragrunt feature ls
command that could be used to list all of the feature flags that are available in a configuration, terragrunt feature rm
to remove a feature flag, terragrunt feature evaluate
to evaluate all feature flags with current context, etc (though I don't think those are features that would be necessary for initial implementation).
There is also functionality that could be added to the feature block that would be difficult to add to a function. For example, configuring a default value for a feature flag might be more likely to be consistent when done via a block than when done via a function.
e.g. To keep a default value consistent across all uses of a function, you might have to do something like:
locals {
do_experiment = feature_flag("DO_EXPERIMENT", false) # Where the second argument is the default value
value1 = local.do_experiment ? "A" : "B"
value2 = local.do_experiment ? "C" : "D"
# Because a different default is used here, it's harder to reason about the value of the feature flag
value3 = feature_flag("DO_EXPERIMENT", true) ? "E" : "F"
}
Whereas, with a block, it's a lot more explicit:
feature "do_experiment"{
default = false
}
locals {
value1 = feature.do_experiment.value ? "A" : "B"
value2 = feature.do_experiment.value ? "C" : "D"
# Here we're explicitly negating the value of the feature flag, the default can't vary between uses
value3 = !feature.do_experiment.value ? "E" : "F"
}
Having a block also allows for more complex feature flag configurations in the future, like the ability to configure a provider for integration with a feature flag management service or to have validations, etc.
dependency
InteractionDependency interactions may be confusing to users, and that interaction should definitely be documented well.
If we split up the feature flag, error suppression, and module skip concepts into separate configuration blocks, the dependency
block won't have any special interaction with feature flags. Feature flags will just be a way of signaling values to Terragrunt modules, and the dependency
block might just experience a difference in the outputs
it extracts from the module.
Error suppression on the other hand will result in special interaction.
On the initial apply of a module, if that module fails, no outputs will be available for the module that depends on it. As long as the module is syntactically correct, the module that depends on it will see the equivalent of an un-applied module. If, for whatever reason, the dependent module is mocking outputs for the dependency block, with a mock_outputs_allowed_terraform_commands
including apply
, the dependent module will use those mocked outputs. This means that the dependent module may still fail to apply if the outputs present are not sufficient to satisfy the dependent module's configuration.
On subsequent applies to a successful apply, failure of the dependency module will result in the dependent module using the outputs from the last successful apply (again, assuming the .tf
files are syntactically correct for the dependency module).
Hopefully, this behavior is consistent with what users would expect, as we would just be swallowing the exit code of the dependency module and interacting with it as if it had succeeded.
Module skip will also result in special interaction.
The first special interaction is that the skipped module will be pruned from any dependency graph that a dependent module is a part of. A run-all
that includes both a skipped module and a module that depends on it will result in only the dependent module being run. This interaction was the motivation behind allowing the skip
block to indicate which actions should be skipped. If a user knows that the module should only be skipped for apply
actions, they can specify that in the skip
block, and still have it pulled into dependency graphs for other actions like plan
and output
.
Hopefully, that won't be too complicated. One aspect that might be tricky is that a dependency graph that includes a skipped module in the middle might be more complicated to construct than one that doesn't.
e.g.
A -> B -> C
Skip B
A -> C
The second special interaction is that the dependent module may not have access to outputs from the dependency module if certain actions are skipped. If the output
action is skipped on the dependency module, the dependent module will have to rely on mock outputs, not read any outputs, or fail. If the apply
action is skipped, the dependent module will have to rely on the outputs from the last successful apply, not see any outputs because the dependency module never applied or fail.
This kind of dynamic behavior might be difficult to reason about, so it may be important to signal carefully when a module is being skipped with a special yellow message in STDERR.
An alternative to the behavior described above would be to allow for the use of new mocks when a module is skipped or fails. I would advocate against this, as existing mocks are already a source of confusion for users, and adding new sources of mocks may be a significant footgun that isn't worth having.
The fact that all of these are defined as configuration blocks does leave them open to extension in the future, however. There may be sufficient demand for behavior like emitting mock outputs when a module is skipped or fails that it would be worth adding in the future.
Feature Flag Dynamicity
I think the env-var and run_cmd
driven approaches you listed both make sense, but the most important use case that's missing, IMO, is "go into the feature flag service, click a button, and now the feature is enabled." That's the bread and butter of using feature toggles, after all: click something, and a feature is enabled or disabled. So I just want to make sure there is some reasonable story around how to set that up. E.g., If the feature toggle service has a webhooks API, maybe you add a webhook that triggers GH Actions to re-run terragrunt apply
with the latest feature flag value.
Note that I'm only looking for guidance here; not first-class features built into TG itself. At least, not at this stage. If this somehow becomes super popular, sure, we can think about native support in plugins or whatever, but for now, I just want to make sure that if we say "TG supports feature flags," that we support it's most common use case, which is enabling/disabling features with a click in a UI.
What do we think about having three separate configuration blocks for feature flags, error suppression, and module skipping?
I'm a big +1 on that. I think we'd want to iterate on exactly what the blocks are, but having these as separate entities seems much more powerful, maintainable, understandable, etc.
There may be advantages to having the feature flag defined via a block, however.
Your analysis is convincing. The block approach wins, hands-down, for helping with readability, understanding, and static analysis/commands based off feature toggles.
The second special interaction is that the dependent module may not have access to outputs from the dependency module if certain actions are skipped. If the output action is skipped on the dependency module, the dependent module will have to rely on mock outputs, not read any outputs, or fail. If the apply action is skipped, the dependent module will have to rely on the outputs from the last successful apply, not see any outputs because the dependency module never applied or fail.
This kind of dynamic behavior might be difficult to reason about, so it may be important to signal carefully when a module is being skipped with a special yellow message in STDERR.
I think making it clear what the behavior will be when a module is skipped (or fails and the failure is ignored). If we use mock outputs or skip or whatever else, we need to make sure it's clear and expected for the user. Maybe even some sort of "use last known in case of skipped or failed dependency" setting, where we use the last known good outputs? Not sure on this, but again, clarity is king here :)
To address some feedback that has been brought up regarding this RFC:
Some folks have asked about the ability to have feature flag updates triggering infrastructure updates, similar to @brikis98's suggestion above.
The envisioned behavior would be something like updating a feature flag in feature flag management software, then having an event dispatched to drive an infrastructure update without having to manually run another Terragrunt update (I'm calling that a "headless" IAC update, but it will likely be called something different if implemented).
This might be delivered as any of:
While a feature that users would likely appreciate, it is out of scope for this RFC. The primary goal of this RFC is to provide a convenient mechanism for exposing dynamic runtime behavior configuration in Terragrunt, not to provide a way to trigger infrastructure updates based on feature flag changes.
This is something we can revisit at a later date after feature flags are released.
Feedback has also been provided that the mechanism here for handling error suppression may be too simplistic.
For edge nodes in the DAG, it is likely sufficient behavior that errors can be optionally ignored, and for the status code of the entire run-all
operation to be set to 0
.
For nodes within the middle or start of the DAG, users may want to handle errors in a more nuanced way.
If a node in the middle of the DAG fails, users may want to stop or change the execution of the rest DAG. This may be because the failing node is known to be flaky, and that certain errors in its execution can be safely ignored, but that execution of the rest of the DAG should be stopped if that node fails, as there may be no point in continuing.
As such, while keeping the behavior of error suppression the same (i.e. the rest of the DAG will continue executing if the previous node fails), some additional configuration will be proposed as to how dependency
blocks work so that they can opt-in to more nuanced behavior in response to error suppression.
The proposed failure
block will be renamed to an errors
block, and will be the future home of all error handling configuration, including retryable_errors
, retry_max_attempts
and retry_sleep_interval_sec
(the existing fields will be preserved for backwards compatibility, but will be overwritten if defined in both places).
In addition, the proposed skip
block will include an additional field skip_dependencies
, which will allow users to configure whether or not children of a given node should be skipped.
The dependency
block will be updated to include the ability to read status information from the dependency it pulls from, based on these configurations and the outcome of execution in the dependency.
e.g.
# ./parent/terragrunt.hcl
errors {
# Errors of type foo are retryable, and should be retried up to 3 times with a 5 second sleep interval
retry "foo" {
retryable_errors = [".*Error: foo.*"]
max_attempts = 3
sleep_interval_sec = 5
}
# Errors of type bar are ignorable, and should be ignored
ignore "bar" {
ignorable_errors = [".*Error: bar.*"]
message = "Ignoring error: bar"
}
# Errors of type baz are ignorable, and should suppress the rest of the DAG
ignore "baz" {
ignorable_errors = [".*Error: baz.*"]
message = "Ignoring error: baz"
}
}
# ./child/terragrunt.hcl
dependency "parent" {
config_path = "../parent"
}
skip {
# Skip child if any errors are ignored in the parent
if = dependency.foo.errors.ignored
# Skip for any `terragrunt` action except `output`. Important, as dependencies will need to extract output from the parent.
actions = ["all_except_output"]
# Skip dependencies if errors are ignored of type baz
skip_dependencies = dependency.foo.errors.ignore.baz.ignored
}
# ./grandchild/terragrunt.hcl
dependency "child" {
config_path = "../child"
}
# Does not run if a `baz` error occurred in `parent`, but will if error `bar` was ignored.
The objective here is to provide a more nuanced way to handle errors within the DAG, but to keep behavior relatively predictable.
One trade-off in this adjustment is that unexpected behavior may occur in the DAG for grandchildren of a node that has suppressed errors. They will have no configuration that indicates that their parent has suppressed errors, but may be skipped if a grandparent has suppressed errors. This is currently the case when a grandparent fails with an error, but we currently emit an exit code of 1, and throw an error in that scenario.
This approach also requires that child dependencies have explicit error handling of ignored errors in parents, which may be very cumbersome for flaky nodes with many dependants. In the scenario that a flaky node has many dependants, it is likely worth making this trade-off, however, as there may be better context for whether a skip is appropriate within a child than in the parent.
This also complicates the API for the dependency
block, which already has quite a bit of responsibility, including leveraging mocked values when outputs aren't available in parent nodes.
In addition to complicating the API of the dependency
block, it also requires that resolution of the DAG be done in a more nuanced way, as all nodes will need to check if a parent determined that they should be skipped.
A couple of notes/questions after reading this RFC
I think all introduced blocks should be named, in this way we can see which one was triggered, like:
skip "skip1" {
if = false
actions = ["all"]
...
}
skip "skip2" {
if = true
actions = ["all"]
...
}
...
failure "fail1" {
...
}
errors "parent_errors" {
retry "foo" {
...
}
...
}
Not sure how it will be handled cases when there are multiple "skip" with contradicted flags like:
skip "s1" {
if = false
actions = ["all"]
}
skip "s2" {
if = true
actions = ["all"]
}
Or such constructions shouldn't be used (in case if blocks aren't named, only one will be allowed)
Usage of nested setup like:
feature "feature_name"{
skip {
...
}
failure {
...
}
}
will not be helpful if users want to have feature flags in the parent file and in children (unknown number) different behavior based on feature flags
Part with signals is not quite clear signals / safe_to_revert
- looks like a custom "transaction log" which should be generated by Terragrunt and later processed buy 3rd party software.
@denis256
The idea was that you shouldn't have multiple skip
or errors
blocks in a terragrunt.hcl
file. That should be considered invalid and throw an error, as complications from conflicting configurations would be problematic. The single errors
block can have multiple named retry
or ignore
configurations, which should be enough to ensure that any nuanced error handling can take place.
When skips occur, a warning should be emitted to stderr that points out which terragrunt.hcl
file was skipped because it had a skip
configuration, which should be enough information to track down which skip was triggered, if it can only have one, right?
Sorry I wasn't clear about this, but I initially started out the RFC with the configurations for skip
and failure
existing inside the feature
block, but upon receiving feedback, I moved them to their own blocks.
Ya, the signals
thing might need more thinking through. That's the general idea. How can something outside of Terragrunt like a CI system, etc look at a failure that happened and take action (like reverting to previous configuration or sending a slack message, etc).
Prepared beta release with support of feature flags
# terragrunt.hcl
feature "run_hook" {
default = false
}
terraform {
before_hook "feature_flag" {
commands = ["apply", "plan", "destroy"]
execute = feature.run_hook.value ? ["sh", "-c", "feature_flag_script.sh"] : [ "sh", "-c", "exit", "0" ]
}
}
Passing feature flags:
terragrunt --feature run_hook=true apply
terragrunt --feature run_hook=true --feature string_flag=dev apply
https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.8-beta2024110601
Cut a beta release that supports exclude
block
# Exclude configurations allowing for dynamically determining when and how to exclude execution of nodes in the Terragrunt graph
exclude {
if = feature.feature_name.value # Boolean expression that determines if the node should be excluded.
actions = ["all"] # Actions to exclude when active. Other options might be ["plan", "apply", "all_except_output"], etc
exclude_dependencies = feature.feature_name.value # Exclude dependencies of the node as well
}
https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.15-beta2024111501
Released in https://github.com/gruntwork-io/terragrunt/releases/tag/v0.68.16 exclude
block
Demo:
Summary
Provide first class support for feature flags as part of Terragrunt HCL configuration.
Allow for dynamic configuration of behavior in select
terragrunt.hcl
files based on the presence or absence of feature flags that are set via environment variables and CLI flags.In addition, update how Terragrunt errors and exckudes are handled to ensure that unstable configuration can be handled gracefully.
Motivation
Terragrunt is frequently used in monorepo contexts, and it lends itself to this in how it segments IAC state into separate directories. One definition of monorepos is a single codebase with multiple independent, but related, projects. By this definition, Terragrunt is very much an IAC monorepo tool. Multiple units of IAC are defined independently, and as a whole, they represent a repository of IAC.
Feature flags are a common way to manage the complexity of a monorepo. They allow for the gradual rollout of new features, the ability to turn off features that are not ready for production, and the ability to manage the complexity of a large codebase.
This is especially important in the context of Terragrunt, where infrastructure is most safely updated when updated in small, incremental changes. In addition, the ability to control how failure is handled in IAC is extremely important. Preventing full resolution of an apply across multiple Terragrunt units because a known flaky unit is failing is not always something that can be remediated by the use of retries, and it can be expensive to do so. Occasionally, it is better to ignore the failure of a known flaky unit and continue with the rest of applies, assuming that the failure is not critical to the overall success of the apply.
An example of such a failure would be a dependency chain where one service is deployed by Terragrunt, and has a
url
output where the service can be accessed, and another service which uses adependency
block to pass thaturl
into the environment variables of a second service.In this example, if the first service fails to deploy, the second service will also fail to deploy. However, if the first service is known to be flaky, and the second service is not dependent on the first service being deployed successfully, it is better to ignore the failure of the first service and continue with the deployment of the second service, leveraging the
url
output from a previous successful apply.Reasons that a unit might be marked in this way include:
Proposal
Provide a combination of:
Proposed Syntax
Examples
The syntax is intended to be flexible enough to support a couple different use-cases that are common when using feature flags.
Dynamic Module Example
Mark a
terragrunt.hcl
file as having a feature that triggers usage of a new module that is not yet stable. In lower environments, this flag is enabled, and in production, it is disabled.In addition, if the apply fails, it is safe to revert the apply, and a special error message is logged to the console.
In this contrived example, the "v2" tag of the module is not currently stable, however, to encourage continuous integration, the platform team has decided to merge in configurations that can use it when a flag is enabled. In the dev environment, the feature flag is enabled, and in the production environment, it is disabled.
When an apply fails, as is expected, a special message is emitted to STDERR to indicate that the source of failure is due to a failure in a feature flag.
In addition, on error, a special
error-signals.json
file will be created in the same directory as theterragrunt.hcl
file with a payload that the platform team knows will be useful to handle the error intelligently. In this scenario, the logic that's being used here that the team has agreed upon is that if anyterragrunt apply
fails, revert to the last commit and re-run the apply, if asate_to_revert
entry is found in theerror-signals.json
for the correspondingterragrunt.hcl
file that was applied.The logic here is definitely not what would work for most organizations to achieve a reliable mechanism for reverting a failed apply. It is merely a demonstration of why authors might want signals emitted on failure.
Unreliable Module Example
Mark a
terragrunt.hcl
file as being unreliable, and ignore any failures with errors matchingNetworking Error
that might occur when applying it.In this example, users are able to mark the
terragrunt.hcl
file in theunreliable
directory as being unreliable, knowing that it predictably produces an error with the messageNetworking Error
that can be safely ignored when re-applied.The ability to ignore errors in the
unreliable
module is handy here, as thereliable
module reads a static output from theunreliable
module that doesn't change much, and uses it as an input.Examples of modules that can have this kind of relationship include:
The dependent modules can continue to codify their dependency relationships to get access to inputs like the database hostname, which is frequently required to connect to the database, and the cluster ID can be passed to the pod, so that its placement can be targeted to the cluster.
In both scenarios, users might find it convenient to be able to avoid failing to successfully deploy the dependent modules when predictable, intermittent errors occur in the dependency.
When using feature flags to support this kind of functionality, the feature flag can be opted-out, via setting an environment variable like so:
This allows for platform teams to safely test removal of ignored failures until the
feature
configuration blocks can be removed (possibly by only disabling the feature in lower environments).In-progress Module Example
Mark a
terragrunt.hcl
file as being in-progress, excluding all operations on it until a certain feature is complete. The feature can be manually turned on when developing locally, but is off by default.When developing the module locally, use the following flag to activate the module:
This is a simple way to allow incomplete IaC work to be integrated into a code-base without requiring that the code be fully mature before merging it in.
Rapid, frequent and incremental integration is the standard in Continuous Integration, and this provides a mechanism for achieving that for large IaC code bases.
In addition, note the
exclude_dependencies
field being used here, which allows for skipping the dependencies of the module as well. This is useful when building out multiple modules that are dependent on each other, and you want to skip the entire chain of dependencies while a module is in-progress.Technical Details
Some components that will definitely be impacted include:
feature
blocks.error_hook
s andretryable_errors
already alter behavior of a normal Terragrunt execution on failure, This would be another tool that can change how errors are handled in Terragrunt due to thefeature.failure
block.TG_FLAG_<feature name>
.--feature
(or maybe--terragrunt-feature
).terragrunt command
when thefeature.skip
conditions are met.terragrunt run-all command
when they havefeature.skip
conditions met.Press Release
First Class Feature Flags
Terragrunt now has built in support for feature flags, allowing behavior of Terragrunt executions to be altered dynamically at runtime.
Feature flags are a staple of modern DevOps best practices, and using them in Terragrunt will allow you to improve the scalability of your IaC code base.
Use feature flags to support the following, and more:
Feature flags are available as of [RELEASE]. To learn more about how to use them, click [here](link to feature flag documentation).
Drawbacks
Some drawbacks of this proposal include:
terragrunt.hcl
file. Users have already been encounteringterragrunt.hcl
files that are too long and difficult to maintain. This added complexity might maketerragrunt.hcl
files even more difficult to reason about.terragrunt.hcl
files might be very difficult to reason about.exclude
logic, during execution of the module if theenabled
status of the feature is used in controlling behavior, and iffailure
logic is used to handle failure.Alternatives
get_env
, and adding custom logic to adjust behavior of executions based on the values of the environment variables.ignored_errors
companion to theretryable_errors
that just ignores errors instead of retrying them. Customers have been asking for functionality like this to support handling both of failures that are not intermittent enough that they might recover from retrying over a short duration, and to handle errors in modules that are computationally or temporally expensive to just retry soon after failure.get_env
andrun_cmd
. Provide nice walkthroughs on how to achieve common feature flag patterns with existing tooling in Terragrunt.These alternatives, while less expensive than undertaking the introduction of net new functionality in Terragrunt, were considered less beneficial, as first class support for feature flags is generally something that makes a good match for Terragrunt, in my opinion.
Option #2 is also not necessarily mutually exclusive. It might be a good idea to pursue that anyways.
Migration Strategy
None
Unresolved Questions
See the section above about the syntax of feature flags.
I also am not sure how expensive this functionality would be to implement and maintain.
Would the community be interested in this functionality, or would they be more interested in any of the alternatives?
References
Proof of Concept Pull Request
N/A
Edits
feature
,skip
, anderrors
. In addition, the proposal now includes some logic for skipping dependencies.skip
toexclude
, there is alreadyskip
attribute in HCLskip_dependencies
toexclude_dependencies
to match naming convention