Terraform Test: add ability to skip teardown

yohanb commented 11 months ago

Terraform Version

~>1.6.0

Use Cases

When writing integration tests for resources which have a long creation time (ex: k8s clusters, database clusters, etc.), the feedback cycle when your test fails can be excruciatingly long. It would be nice if we could skip the teardown process and continue writing our tests against these resources.

Attempted Solutions

N/A

Proposal

Add a CLI flag to terraform test to skip the teardown/destroy process and/or add an attributes to skip the teardown on the run block. Examples terraform test --skip-teardown

run "integration" {
  skip-teardown = true
  ...
}

References

No response

apparentlymart commented 11 months ago

Hi @yohanb! Thanks for this suggestion.

Are you imagining that Terraform would just totally ignore the objects that were created, and expect you to clean them up manually in the remote system? Do you have any concern that a subsequent run of the test would fail due to trying to create new objects that conflict with objects that were produced on the previous run?

Thanks again!

yohanb commented 11 months ago

Hi @yohanb! Thanks for this suggestion.

Are you imagining that Terraform would just totally ignore the objects that were created, and expect you to clean them up manually in the remote system? Do you have any concern that a subsequent run of the test would fail due to trying to create new objects that conflict with objects that were produced on the previous run?

Thanks again!

Hi @apparentlymart ! Thanks for the quick feedback. I was thinking that it could optionally keep the state of what it had created so it doesn't have to recreate them everytime. In this case, it would be the responsibility of the developer to cleanup resources and avoid conflicts.

apparentlymart commented 11 months ago

Thanks for that extra context, @yohanb!

What you've described is something that was possible in the original experimental form of terraform test where a test case was really just a normal module and the test harness was literally just applying and then destroying each one in turn. In that case, it was possible to skip the test harness while iterating on one particular test -- just using the test module with the normal plan and apply commands and only use the test harness to automate running the tests in bulk once they are written.

The new .tftest.hcl language in the final design is important for testing more elaborate situations like changes made over time, but in return it makes it a little harder to essentially halt the process partway through and then do normal-ish Terraform workflow things with that interim state before proceeding.

The first thing that comes to my mind thinking about this problem is something like Git's "interactive rebase" mechanism, where you can effectively halt a rebase partway through and then run normal git commands to manipulate things before you proceed, or to bail out and clean up the intermediate state of things and return to normal git usage. I'm sure there are other ways to solve this too, but I mention that only to see if it resonates with you.

In the meantime though, since the original testing design was essentially just a convention for laying out normal Terraform root modules to use for testing, it's still possible to work that way even though terraform test doesn't inherently understand it anymore. You can bridge the two techniques by using the optional ability to specify a separate module for a test step. If you make a directory containing a separate Terraform root module that calls the module you want to test, you can use normal Terraform workflow commands like init, plan, apply directly in that directory while you are iterating but then wire that module in to one of your .tftest.hcl files so that it will also run through the automated test harness. This more-or-less recreates the model from the experimental terraform test, allowing you to treat the test case as a normal Terraform configuration in situations where that's more convenient.

yohanb commented 11 months ago

@apparentlymart thanks, appreciate the feedback. I was thinking more the sense of iterating the test cases. For example, you want to assert something on a module which takes 20+ minutes to setup and then your test fails for whatever reason. The test will automatically tear it down and you have to start over to fix your assertions. I understand the workaround you propose where you can create a module to indirectly call your root module and iterate that way but you can effectively do that without the test framework right?

apparentlymart commented 11 months ago

Yes, I was suggesting to do it without the test harness while you are developing/iterating and to use the test harness only for running all the tests at once after you are finished developing to make sure they all still work under the specific sequence of operations described in the test configuration files.

In making that suggestion I'm thinking that a test configuration file is effectively just automating a sequence of plan and apply commands interspersed with arbitrary checks. In situations where the automation is inconvenient, you can run the same operations manually.

One way I imagine making this more integrated in future is to help automate the setup of such an environment. You might specify a single test configuration file and a single step within that file, and then Terraform would:

Execute all of the test steps prior to the chosen one in the same file non-interactively just as would happen in a normal test run.
Once the specified step is reached, write the current in-memory-only state to a special new filename under the .terraform directory. The presence of a file at that new location tells Terraform CLI that it should run in a special mode where operations occur against the transient test state instead of whatever normal workspace is selected.
At this point Terraform exits and returns you to your shell prompt. Any normal Terraform commands you run, like terraform plan or terraform apply, will work against the transient test state that was created in the previous step. You can iterate as much as you like and do anything to that state that you could normally do to a workspace state.
Once you're finished, you run another test-specific command to exit this testing mode. At this point you could have at least the following options for how to proceed:
- Continue running the remaining test steps non-interactively to completion, including the normal destroy at the end.
- Run only the currently-active test step and then halt again at the next one still in the transient test state mode, but now with the effects of another step applied to the state.
- Destroy everything that's currently existing and halt completely. You might do this if you've made such a large change to the test state that the remaining steps no longer make sense to run, for example.
- Possibly it might make sense to "upgrade" the transient test case into a normal named workspace in the configured backend, if you decide for some reason that the test infrastructure is now "real" infrastructure in some sense. (Maybe now a long-lived development environment?)

Most of what I described above is already possible to do manually by explicitly creating a separate root module to iterate in, so my ideas above are intended to "pave the cowpath" by having the system do those steps itself and to be able to reuse the main module directory (in a special new mode) instead of having to make a new working directory to develop in.

I'm sure there are other ways to do stuff like this, but as I mentioned before this is inspired (at a very high level) by the "git rebase" model, where your work tree temporarily switches into a different mode where you can run various normal git commands to do open-ended, self-directed operations and then either continue rebasing subsequent commits or otherwise bail out and return to the "normal" mode.

fxkk commented 11 months ago

I stumbled across another use-case for introducing a skip-teardown attribute. I think there are cases where it makes sense to create resources in run modules that are dependent on the resources of the main module.

At the moment it is documented that this is not possible and leads to errors in the destroy process.

My use case would be as follows:

First, a vault container should be created with a setup/run module. In the second step, the module to be tested should be applied to create policies and other resources in the vault. After that, a test module with a token should create resources in the vault based on the new policy. This is to verify that the created policies work as planned.

However, since the resources created for the test depend in part on resources created in the main module, the destroy inevitably fails.

Since the created resources were only created in a temporary vault container, the destroy error is actually irrelevant. The resources would not necessarily have to be deleted, or the error could be ignored. I have not yet found a way to do either.

This case is a bit different from the one discussed before, but it could be an argument to make the destroy of run blocks optional.

I think that this problem could be generalizable with classes of "disposable resources". For example, it would also apply to objects in a k8s cluster that is only created in the main module.

By the way, as a quick feedback, I really like the test feature itself. Once you've internalized a few basic ideas, you can build some pretty cool things with it. Thanks for your work.

omarismail commented 7 months ago

Hey @yohanb (and others in this issue), the Terraform team is doing research into this problem, and I'd love to chat to learn more. Please reach out to me oismail@hashicorp.com and we can schedule a time to chat!

yohanb commented 6 months ago

Hey @yohanb (and others in this issue), the Terraform team is doing research into this problem, and I'd love to chat to learn more. Please reach out to me oismail@hashicorp.com and we can schedule a time to chat!

Hi @omarismail ! Thanks for the opportunity. Will do!

brettcurtis commented 5 months ago

I'm coming over from Kitchen-Terraform, and maintaining the state and keeping the resources around until we explicitly request destruction is an absolute must for us. I'd prefer to test using a remote backend like any other Terraform. As the OP said, the feedback loop on complex modules is long enough, let alone a rebuild with each problem.

For example, we have a backend setup for testing, and all module developers can converge (Kitchen-Terraform language) locally when working on a module. This allows other developers to hop in, help with issues, and pick up exactly where the initial developer is struggling. Converge workflows run on pull request and merge to main triggers a destroy. We need tests at a workflow level since we can't force developers to run tests locally. The tests in workflows need to pass before a merge to main.

Another valuable practice for developers is running the test against the main branch before they work on their branch; this way, they can see the impact of their change on existing infrastructure and know how to properly version the module release, for example, if they introduced a breaking change.

hashicorp / terraform