Concurrent updates of tfstate are very slow

freedge commented 1 year ago

Terraform Version

Terraform v1.5.0-dev
on linux_amd64

(main branch commit c06db2aaddad1ef58cdc0eb46166aecfd705b7e9)

Terraform Configuration Files

...terraform config...

Debug Output

2023-04-13T20:55:23.998+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
^[[0m^[[1mmodule.addresses[0].panos_address_object.address_objects["N-10.6"]: Still destroying... [id=vsys1:N-10.6, 10s elapsed]^[[0m^[[0m
2023-04-13T20:55:25.485+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 13709
2023-04-13T20:55:25.485+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate
2023-04-13T20:55:25.964+0200 [TRACE] vertex "module.addresses[0].panos_address_object.address_objects[\"N-10.19\"] (destroy)": visit complete
2023-04-13T20:55:25.964+0200 [TRACE] NodeAbstractResouceInstance.writeResourceInstanceState: removing state object for module.addresses[0].panos_address_object.address_objects["N-10.59-"]
^[[0m^[[1mmodule.addresses[0].panos_address_object.address_objects["N-10.59-"]: Destruction complete after 12s^[[0m
2023-04-13T20:55:25.983+0200 [TRACE] statemgr.Filesystem: have already backed up original terraform.tfstate to terraform.tfstate.backup on a previous write
2023-04-13T20:55:26.912+0200 [TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to 13710
2023-04-13T20:55:26.912+0200 [TRACE] statemgr.Filesystem: writing snapshot at terraform.tfstate

Expected Behavior

the marshalling of the state, that takes 0.5s in my case, should be done once, before the lock is taken.

we experience performance degradation with a large tfstate.

Actual Behavior

marshalling of the state is done twice, and there is also an extra marshalling of the backup state, done with the lock taken, causing terraform apply to be super slow.

Steps to Reproduce

have a 40MB tf state with 10k resources
apply a change of 100 resources

though the behavior of this case (json marshalling done 3 times with lock taken) should be common to all terraform apply.

Additional Context

during my tests in observe that in practice the state is always changed. Running with

diff --git a/internal/states/statemgr/filesystem.go b/internal/states/statemgr/filesystem.go
index bdfc6832b5..27405f7e1c 100644
--- a/internal/states/statemgr/filesystem.go
+++ b/internal/states/statemgr/filesystem.go
@@ -199,7 +199,7 @@ func (s *Filesystem) writeState(state *states.State, meta *SnapshotMeta) error {
        }

        if meta == nil {
-               if s.readFile == nil || !statefile.StatesMarshalEqual(s.file.State, s.readFile.State) {
+               if s.readFile == nil || true {
                        s.file.Serial++
                        log.Printf("[TRACE] statemgr.Filesystem: state has changed since last snapshot, so incrementing serial to %d", s.file.Serial)
                } else {

makes the terraform apply completes in 2m15 instead of 5min30

References

No response

petemounce commented 1 year ago

@freedge How did you capture the flamegraph, please? (I'm doing a little existing-issue hunt before I file a bug report about terraform 1.4.5 being ~2-3x slower than 1.3.7, and having some more data is always nice).

freedge commented 1 year ago

you need to build terraform with pprof with some modification

diff --git a/main.go b/main.go
index c807bb5a40..cc18baa0a2 100644
--- a/main.go
+++ b/main.go
@@ -25,6 +25,9 @@ import (
        "github.com/mitchellh/colorstring"

        backendInit "github.com/hashicorp/terraform/internal/backend/init"
+
+       "net/http"
+       _ "net/http/pprof"
 )

 const (
@@ -55,6 +58,10 @@ func init() {
 }

 func main() {
+       go func() {
+               log.Println(http.ListenAndServe("localhost:6059", nil))
+       }()
+
        os.Exit(realMain())
 }

hashicorp / terraform