Custom error code for Terraform CLI if the check assertion fails

hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

https://www.terraform.io/

Other

42.4k stars 9.5k forks source link

Custom error code for Terraform CLI if the check assertion fails #33174

Open vasylenko opened 1 year ago

vasylenko commented 1 year ago

Terraform Version

Terraform v1.5.0-alpha20230504

Use Cases

With the current implementation of the check block, assertion results do not affect the exit code of the apply command: if the apply is successful, a user will get exit code 0 despite any failing assertions.

It would be useful to have a detailed exit code for such a case to rely on it in CI when check blocks are used for smoke/functional tests of the created infrastructure.

Example use case:

resource "local_file" "example" {
  filename = "${path.module}/example.txt"
  content  = "foobars"
}

resource "aws_s3_bucket" "example" {
  bucket_prefix = "tf-checks-example"
  force_destroy = true
}

resource "aws_s3_object" "localfile" {
  bucket       = aws_s3_bucket.example.id
  key          = "example.txt"
  source       = local_file.example.filename
  content_type = "text/*"
}

check "verify_s3object_content" {
  data "aws_s3_object" "this" {
    bucket = aws_s3_bucket.example.id
    key    = aws_s3_object.localfile.key
  }
  assert {
    condition     = data.aws_s3_object.this.body == "bananasapples"
    error_message = "S3 object content must be bananasapples"
  }
}

Terraform apply:

➜  check-block  # terraform apply -auto-approve                                                                              
local_file.example: Refreshing state... [id=*****]
aws_s3_bucket.example: Refreshing state... [id=*****]
aws_s3_object.localfile: Refreshing state... [id=example.txt]
data.aws_s3_object.this: Reading...
data.aws_s3_object.this: Read complete after 0s [id=*****/example.txt]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
 <= read (data resources)

Terraform will perform the following actions:

  # data.aws_s3_object.this will be read during apply
  # (config will be reloaded to verify a check block)
 <= data "aws_s3_object" "this" {
      + body                   = "foobars"
      + bucket                 = "*****"
      + bucket_key_enabled     = false
      + content_length         = 11
      + content_type           = "text/*"
      + etag                   = "7fac576e6ee21e27ce056bb55cbc1db0"
      + id                     = "*****/example.txt"
      + key                    = "example.txt"
      + last_modified          = "Wed, 10 May 2023 09:37:21 UTC"
      + metadata               = {}
      + server_side_encryption = "AES256"
      + storage_class          = "STANDARD"
      + tags                   = {}
    }

Plan: 0 to add, 0 to change, 0 to destroy.
data.aws_s3_object.this: Reading...
data.aws_s3_object.this: Read complete after 1s [id=*****/example.txt]
╷
│ Warning: Check block assertion failed
│ 
│   on main.tf line 33, in check "verify_s3object_content":
│   33:     condition     = data.aws_s3_object.this.body == "bananasapples"
│     ├────────────────
│     │ data.aws_s3_object.this.body is "foobars"
│ 
│ S3 object content must be bananasapples
╵

Exit code:

➜  check-block  # echo $?                                                                                                    
0

Attempted Solutions

none

Proposal

-detailed-exitcode option for apply command and status 2 for the case of failed assertions, similar to the one that plan command has (-detailed-exitcode)

References

No response

apparentlymart commented 1 year ago

Hi @vasylenko! Thanks for sharing this use-case.

Part of the design requirements for check blocks is that they are not treated as errors, and instead just as warnings after the apply is complete. This is what differentiates check blocks from precondition and postcondition blocks elsewhere in the language.

For your use-case, would it be sufficient for you to define a postcondition on resource "aws_s3_object" "localfile", so that it'll be considered failed if the content doesn't match what you specified?

resource "aws_s3_object" "localfile" {
  bucket       = aws_s3_bucket.example.id
  key          = "example.txt"
  source       = local_file.example.filename
  content_type = "text/*"

  lifecycle {
    postcondition {
      condition = self.content == "bananasapples"
      error_message = "S3 object content must be bananasapples."
    }
  }
}

Alternatively, you could write it as a precondition checking that the local_file.example object meets the expectation, which would then allow Terraform to detect it before even creating the S3 object, so you would never have an invalid object in your bucket:

resource "aws_s3_object" "localfile" {
  bucket       = aws_s3_bucket.example.id
  key          = "example.txt"
  source       = local_file.example.filename
  content_type = "text/*"

  lifecycle {
    precondition {
      condition = local_file.example.content == "bananasapples"
      error_message = "File content must be bananasapples."
    }
  }
}

In both of these cases terraform apply should fail with a nonzero exit code if the condition isn't met. In the latter case, Terraform may be able to catch the problem during the plan phase, as long as the content of the local file content is known during planning as is the case in these contrived examples.

Another different approach would be to analyze the checks in the resulting state after terraform apply has finished. To do that, you could run terraform show -json to obtain the documented JSON state representation, and then use the checks representation to examine the check results using some custom software written by you.

A simple implementation would be to check just the top-level checks property "status", which summarizes the statuses of all checkable objects into a single string value.

That particular part of the JSON output format is not yet committed as stable, but feedback about folks successfully using it will help establish the confidence to include it in Terraform's compatibility promises in a future version.

We typically use these JSON formats as the integration point for new features because JSON is a far more flexible and extensible format for sharing data than process exit codes. -detailed-exitcode is covered by Terraform's compatibility promises and so will be preserved for the foreseeable future, but it's also possible to achieve the effect of that option by analyzing the output from terraform show -json PLANFILE, so this JSON-based approach is a superset of what we can possibly represent by process exit codes.

vasylenko commented 1 year ago

Hi @apparentlymart, and thank you for such a detailed response!

My use case was an example one, not a real-life one. I just wanted to illustrate how different resources were created together to compose some logical unit/functionality that should have been tested as a whole afterward (so postcondition would not fit because it is bound to a specific resource). Looks like, not the best example 😅

But anyway, the point was in the way to automate the post-apply checks validation.

I appreciate your explanation of the JSON approach 👍 I did not know that checks are stored in the state file. Actually, I thought (for some reason) that they were not stored intentionally. And it seems that such an approach fits well into CI automation.

One note, though, is that verifying the checks from the plan file looks like less reliable than reading the post-apply state file because some check assertions might be altered by the apply. I am referring to the last part of your response, where you mentioned terraform show -json PLANFILE.

apparentlymart commented 1 year ago

Indeed, if you check the plan JSON then some of the checks may have an unknown status, if the condition depends on something Terraform cannot predict until the apply step. There isn't any way to avoid that; Terraform can only report what it knows. Anything that Terraform knows enough to generate an error or warning about during the plan phase will also be available in the plan JSON though, since these both use the same source of data.

OJFord commented 1 year ago

Part of the design requirements for check blocks is that they are not treated as errors, and instead just as warnings after the apply is complete. This is what differentiates check blocks from precondition and postcondition blocks elsewhere in the language.

Another thing that differentiates them though is that they're 'global' - not attached to a particular resource. You can make assertions in a check block for which there's no obvious resource to put it in as a lifecycle precondition.

For example:

check "not_default_workspace" {
  assert {
    condition     = terraform.workspace != "default"
    error_message = "Selected workspace must be [...] Do not use the default workspace."
  }
}

A workaround I suppose could be:

resource "null_resource" "not_default_workspace" {
  lifecycle {
      precondition {
       condition     = terraform.workspace != "default"
       error_message = "Selected workspace must be [...] Do not use default workspace."
     }
   }
}

but it seems a bit unfortunate and (as if it would be) unidiomatic - seeing that I could easily think 'oh, it should just be a check', overlooking the difference in exit code behaviour.

Another way of saying this I suppose is that there's no way of doing validation on quasi input variables like terraform.workspace.

apparentlymart commented 1 year ago

Indeed, the requirement that a precondition or postcondition be able to prevent evaluation of stuff downstream means that they must be attached to something so that there's a clear idea of what is "downstream".

check blocks cannot actually block anything from happening and so they don't have that requirement. But it also means that they aren't useful for a check like what you showed here -- that the workspace is valid -- because Terraform would still complete all of the other operations implied by the configuration before actually checking that rule.

Attaching a precondition to a null_resource or terraform_data and then making everything else depend on that resource is the best we can do for that use-case right now, but I agree that it's not really a valid solution. That's just not a use-case the current features were designed to address.

My first idea for that particular use-case would be something like whole-module preconditions that must hold or else Terraform will not evaluate anything inside that module at all. Some sense of pre and postconditions for whole modules is something we considered but cut from the initial scope because we wanted to do more research before committing to a design, and this use-case will be a good input into that research, so thanks!

This particular example pokes at one of the specific things we wanted to research more after the initial round of pre/postconditions work: for resources there's both logic written by the provider developer that applies to any use of a resource type, and logic in the calling resource block (the condition blocks) that only applies to a particular resource block. By analogy then, it seems like an equivalent mechanism for modules would require both a way to specify preconditions inside the module that apply to all uses of it, and preconditions inside the call (the module block in the parent module) that apply only to one particular call to the module. The details of that are a little more tricky than they seem though, because provider developers get to return errors from the perspective of the calling resource block rather than from the internals of the provider, and we ought to provide a similar facility for module authors so that their modules can return useful error messages that refer to the caller's mistake directly, rather than to some internal detail inside the module (as tends to happen with both the "condition on a null resource" workaround and with slightly misusing check blocks for configuration enforcement rather than the "is the remote system working as expected after apply?" checks that feature is designed for.)

The example of enforcing a particular workspace seems like it would be most at home as a precondition inside the root module. It being in the root module rather than a child module does simplify things a little because there is no calling module block to try to attribute the error to, and so maybe there's room for a partial design that initially only deals with that simpler case, as long as there seems to be a plausible path from there to the more general solution later.

OJFord commented 1 year ago

Thank you for the reply, that's all very thorough and perfect-sense-making, as usual.

I only wanted to add that your 'and then [make] everything else depend on it' addition to my proposed workaround made me realise that was missing - for something other than the workspace, which is a bit of a special case - but I think for a broader class adding the thing which is conditioned on to triggers should be sufficient.

Regarding your last paragraph, actually something I naïvely tried (I thought the docs slightly vague in a way that made it seem plausible) was a precondition block outside of any resource or lifecycle block. I think that's what you were alluding to anyway, but that would seem to me a neat continuation of syntax - it could benefit root modules as well as children outside of what's validatable (🤨) via variables. Including as syntactic sugar for cases where some local variable is determined entirely by module variables, but the derivation is complex and a module-level condition would be significantly simpler than variable-level validation. In fact I believe it would be required if the condition were dependent on multiple variables? (Sum to less than X, combine to form a valid subnet, etc.)

apparentlymart commented 1 year ago

Yeah, to show something a bit more concrete about what I was describing as "attributing an error to the call", one of the earlier design sketches included something like this:

precondition {
  condition     = var.b == var.a
  error_message = "b must equal a."

  # Possible new addition: names a particular object to "blame" if the
  # condition doesn't hold, so we have a single unambiguous thing
  # to highlight in the error message.
  cause = var.b
}

I think the original sketch I'm thinking of actually had this cause thing illustrated inside a resource-scoped precondition rather than a toplevel, and I'd imagined that it could also be valid to say something like cause = aws_instance.foo if the problem should be attributed to another resource inside the same module, but that ended up being of marginal benefit since you can write a similar thing with a postcondition inside aws_instance.foo anyway.

Using an input variable as a cause does allow something that cannot be expressed today though, particularly if placed in a top-level precondition block as you imagined, since it would be a bit odd to have a "postcondition" for an input variable when input variables don't actually have any side-effects for it to be a postcondition of.

andrewhertog commented 1 year ago

Just to add to this conversation, I have a use case with AWS Aurora where if you have auto-minor-updates enabled, AWS will automatically update your db instance and cause drift between terraform and AWS. This can be disastrous as AWS will destroy and recreate your database if you try to apply a lower minor version.

We have added the check block to check for a delta in the version where our input variable is a lower version number than what is in AWS. It would be nice if this would cause an error rather than a warning specifically as warnings are very easy to over look within a CI pipeline.

apparentlymart commented 1 year ago

Hi @andrewhertog,

With the checks mechanism as implemented today it's not designed to support use-cases that compare the planned new state to the prior state; I assume you've achieved that by doing something a little risky with fetching the same object as a data block as you're also managing with a resource block.

The current intended way to deal with policy around what changes are acceptable is to save the plan to disk, export it as JSON, and implement the checks in an external program that analyzes the plan JSON. I would suggest using that approach instead with today's Terraform because it is the mechanism that was designed to solve this problem, rather than lightly abusing a feature that was intended to solve a different problem -- verifying that the system is in a consistent state after apply is complete.

With all of that said, I don't mean to say that it's not valid to want a Terraform language feature for constraining what proposed changes are valid or invalid, to avoid the need for an external program. I just mean that I think that would want to be a new feature separate from checks, since the use-cases that motivated checks were all about post-apply checking rather than plan checking. I think the testing language coming in Terraform v1.6 will provide some inspiration about what it might look like to describe rules for validating planned changes, since testing that a particular change to input variables doesn't cause knock-on disruptive changes is one of the intended use-cases. Perhaps we will eventually borrow some of the syntax and vocabulary from the testing language to use in the main language too, so that assertions about plan validity will look the same across both main configuration and test cases.

matt-FFFFFF commented 1 year ago

Part of the design requirements for check blocks is that they are not treated as errors, and instead just as warnings after the apply is complete. This is what differentiates check blocks from precondition and postcondition blocks elsewhere in the language.

Hi @apparentlymart

I stumbled across this thread today.

I agree with the design choice here, however I'd love there to be a native method to pull check status without resorting to state pull and jq.

I think there is a parallel with plan where the changes detected is not an error, however we can represent it as one by choice. Therefore I think that a -detailed-exitcode for apply could work.

Alternatively, perhaps an entire subcommand could be dedicated to retrieving check status?

Proposal, inspired by output:

$ terraform check -help
Usage: terraform [global options] check [options] [NAME]

  Displays check information from a Terraform state file and prints the status and error message.
  Leave the name empty to show all checks.

Options:
    -failed       List failed checks only
    -json         Machine readable output

apparentlymart commented 1 year ago

Thanks for that suggestion, @matt-FFFFFF!

Funnily enough one of the earlier prototypes that the current checks functionality evolved from did have a CLI command for retrieving the checks, although for that prototype I only implemented a human-oriented form of it: https://github.com/hashicorp/terraform/pull/31268

It does seem reasonable to me to have a specialized command to report on the current check status in the latest state snapshot. I've not personally been working on the details of checks for a while now, so I can't say anything about feasibility or priority at the moment, but it does seem like an idea worth exploring.