force-unlock fails on kubernetes backend

lud0v1c commented 2 years ago

Terraform Version

Terraform v1.1.0
on linux_amd64

Terraform Configuration Files

terraform {
  backend "kubernetes" {
    config_path   = "/home/lud0v1c/.kube/config"
    secret_suffix = "state"
  }
}

data "terraform_remote_state" "this" {
  backend = "kubernetes"
  config = {
    secret_suffix    = "state"
    load_config_file = true
    config_path      = "/home/lud0v1c/.kube/config"
  }
}

Debug Output

https://gist.github.com/lud0v1c/5e655d1a4fae07c69a217665435b56d2

Expected Behavior

There's no state in the backend so it should create a new one.

Actual Behavior

Every operation fails due to another apparent terraform client doing an operation, like the debug output shows. This doesn't allow me to do anything, not init/plan/apply. force-unlock/state rm/pull don't work either.

Steps to Reproduce

terraform init
Now that the k8s backend is configured, perform any operation like a plan, and kill it unsafely (like a double CTRL+C) while it's happening.
Delete the state via kubectl delete secret tfstate-default-state.
Try to init again, either with a fresh new state terraform init or -migrate-state
Additional Context

I setup a k3s cluster some days ago, and yesterday I switched from local state storage to storing the state on the cluster itself. \ Deployed the backend and terraform_remote_state without any problem. Everything was OK until an operation I was performing got killed due to network issues (an apply on my desktop Windows PC). \ Knowing what this does and since no changes were made, I deleted the tfstate secret in the cluster. I can confirm there are no tfstate secrets in any namespace whatsoever. \ Looking this up online, people mentioned that it could be another client/process but I looked at all local processes, and even tried initializing on my laptop (with the original failed operation computer shut down) but that also fails, always with the same error message. \ I really can't understand from where this state/operation is being fetched, even rebooting my k8s nodes did nothing!

jbardin commented 2 years ago

Hi @lud0v1c,

Thanks for filing the issue. Even in the absence of a state file, a backend must still enforce a lock to prevent multiple instances of Terraform from writing a new state concurrently. This may be implemented in different ways in different backends due to different storage constraints and mocking mechanisms, so the failure modes will be slightly different between each.

Since the process was killed without releasing the lock, and that backend has a persistent locking mechanism, the lock will have to be manually released. You should be able to do that by passing the lock id into the force-unlock command:

terraform force-unlock 76a2bab0-12b1-5b0e-395e-46177a0fe849

If that doesn't work please let us know and we can mark this as an issue with the k8s backend.

Thanks!

lud0v1c commented 2 years ago

Hey @jbardin, thank you for the reply. As I've stated in the Actual Behaviour, force-unlock somehow doesn't work:

PS C:\Users\lud0v1c\orion-cluster> terraform force-unlock 76a2bab0-12b1-5b0e-395e-46177a0fe849
2021-12-13T15:38:47.907Z [INFO]  Terraform version: 1.1.0
2021-12-13T15:38:47.907Z [INFO]  Go runtime version: go1.17.2
2021-12-13T15:38:47.908Z [INFO]  CLI args: []string{"C:\\ProgramData\\chocolatey\\lib\\terraform\\tools\\terraform.exe", "force-unlock", 
"76a2bab0-12b1-5b0e-395e-46177a0fe849"}
2021-12-13T15:38:47.908Z [TRACE] Stdout is a terminal of width 137
2021-12-13T15:38:47.908Z [TRACE] Stderr is a terminal of width 137
2021-12-13T15:38:47.908Z [TRACE] Stdin is a terminal
2021-12-13T15:38:47.911Z [DEBUG] Attempting to open CLI config file: C:\Users\lud0v1c\AppData\Roaming\terraform.rc
2021-12-13T15:38:47.911Z [DEBUG] File doesn't exist, but doesn't need to. Ignoring.
2021-12-13T15:38:47.911Z [DEBUG] ignoring non-existing provider search directory terraform.d/plugins
2021-12-13T15:38:47.911Z [DEBUG] ignoring non-existing provider search directory C:\Users\lud0v1c\AppData\Roaming\terraform.d\plugins       
2021-12-13T15:38:47.912Z [DEBUG] ignoring non-existing provider search directory C:\Users\lud0v1c\AppData\Roaming\HashiCorp\Terraform\plugins
2021-12-13T15:38:47.913Z [INFO]  CLI command args: []string{"force-unlock", "76a2bab0-12b1-5b0e-395e-46177a0fe849"}
2021-12-13T15:38:47.914Z [TRACE] Meta.Backend: built configuration for "kubernetes" backend with hash value 2627546192
2021-12-13T15:38:47.915Z [TRACE] Preserving existing state lineage "5c443a4e-f465-eea1-9f23-69cedf912e70"
2021-12-13T15:38:47.915Z [TRACE] Preserving existing state lineage "5c443a4e-f465-eea1-9f23-69cedf912e70"
2021-12-13T15:38:47.915Z [TRACE] Meta.Backend: working directory was previously initialized for "kubernetes" backend
2021-12-13T15:38:47.916Z [TRACE] Meta.Backend: using already-initialized, unchanged "kubernetes" backend configuration
2021-12-13T15:38:47.916Z [DEBUG] Using kubeconfig: C:\Users\lud0v1c\.kube\orion
2021-12-13T15:38:47.917Z [INFO]  Successfully initialized config
2021-12-13T15:38:47.918Z [TRACE] Meta.Backend: instantiated backend of type *kubernetes.Backend
2021-12-13T15:38:47.918Z [DEBUG] checking for provisioner in "."
2021-12-13T15:38:47.918Z [DEBUG] checking for provisioner in "C:\\ProgramData\\chocolatey\\lib\\terraform\\tools"
2021-12-13T15:38:47.919Z [TRACE] Meta.Backend: backend *kubernetes.Backend does not support operations, so wrapping it in a local backend
Failed to load state: the state is already locked by another terraform client
Lock Info:
  ID:        76a2bab0-12b1-5b0e-395e-46177a0fe849
  Path:
  Operation: OperationTypeApply
  Who:       ZEUS\lud0v1c@zeus
  Version:   1.1.0
  Created:   2021-12-11 23:03:32.907186 +0000 UTC
  Info:

I've also tried terraform state pull to see if I could get a hint or a more descriptive error message on where the lock was defined/store, but nothing.. The backend k8s cluster also doesn't report anything out of usual (no state there as I've mentioned).

jbardin commented 2 years ago

Thanks @lud0v1c, that looks like a bug in the kubernetes backend implementation, somehow preventing any access even when only deleting the lock. The kubernetes lock is implemented via a lease, which is separate from the state object. I'm not sure offhand what the required commands are, but there is probably a way to list and delete existing leases from the kubectl command directly.

lud0v1c commented 2 years ago

@jbardin Thank you for the hint! I've read about using Leases in the k8s backend documentation, but haven't really interacted with them before. After performing a kubectl delete lease lock-tfstate-default-tfstate, I was able to init and continue as normal. This is what I had listed in the default namespace:

NAME                           HOLDER                                 AGE
lock-tfstate-default-tfstate                                          2d
lock-tfstate-default-state     76a2bab0-12b1-5b0e-395e-46177a0fe849   47h

I'm not sure if I should close the issue in case you guys want to investigate more, so I'll leave it at your discretion 😃

jbardin commented 2 years ago

Thanks for the info @lud0v1c! That's helpful if anyone else encounters this. I'll leave the issue open, since the terraform force-unlock command should have been able to complete the same procedure.

hashicorp / terraform