terraform-linters / tflint

A Pluggable Terraform Linter
Mozilla Public License 2.0
4.96k stars 357 forks source link

`tflint --langserver` does not clean up child processes and socket files when exiting #1808

Closed wzyboy closed 1 year ago

wzyboy commented 1 year ago

Summary

When used as a language server (e.g. with Neovim as a client), tflint does not clean up properly when exiting. This can be reproducied by running tflint --langserver directly in a terminal and observe its behaviour.

  1. run tflint --langserver in a workspace
  2. wait a few seconds
  3. press ^C to stop the process
  4. repeat steps 1 to 3 for X times
  5. run ps x | grep tflint and there are X copies of tflint --act-as-bundled-plugin and tflint-ruleset-aws processes
  6. run ls /tmp/plugin* and there are X copies of socket files

I found this after using Neovim to edit many Terraform files and the RAM was almost full. I checked the process list and found this out. Further experiment revealed that it's not related to Neovim but tflint --langserver itself.

Command

tflint --langserver

Terraform Configuration

Not related. The issue exists even in an empty directory.

TFLint Configuration

plugin "terraform" {
  enabled = true
  preset  = "recommended" # default: "all"
}

rule "terraform_typed_variables" {
  enabled = false
}

rule "terraform_required_providers" {
  enabled = false
}

rule "terraform_required_version" {
  enabled = false
}

rule "terraform_deprecated_interpolation" {
  enabled = false
}

plugin "aws" {
  enabled = true
  version = "0.24.1"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

Output

# running in an empty directory

$ TFLINT_LOG=debug tflint --langserver
09:43:37 option.go:71: [DEBUG] CLI Options
09:43:37 option.go:72: [DEBUG]   Module: false
09:43:37 option.go:73: [DEBUG]   Force: false
09:43:37 option.go:74: [DEBUG]   Format:
09:43:37 option.go:75: [DEBUG]   Varfiles:
09:43:37 option.go:76: [DEBUG]   Variables:
09:43:37 option.go:77: [DEBUG]   EnableRules:
09:43:37 option.go:78: [DEBUG]   DisableRules:
09:43:37 option.go:79: [DEBUG]   Only:
09:43:37 option.go:80: [DEBUG]   EnablePlugins:
09:43:37 option.go:81: [DEBUG]   IgnoreModules:
09:43:37 langserver.go:26: Starting language server...
09:43:37 config.go:137: [INFO] Load config: .tflint.hcl
09:43:37 config.go:147: [INFO] file not found
09:43:37 config.go:155: [INFO] Load config: /home/zhuoyun/.tflint.hcl
09:43:37 config.go:269: [DEBUG] Config loaded
09:43:37 config.go:270: [DEBUG]   Module: false
09:43:37 config.go:271: [DEBUG]   ModuleSet: false
09:43:37 config.go:272: [DEBUG]   Force: false
09:43:37 config.go:273: [DEBUG]   ForceSet: false
09:43:37 config.go:274: [DEBUG]   DisabledByDefault: false
09:43:37 config.go:275: [DEBUG]   DisabledByDefaultSet: false
09:43:37 config.go:276: [DEBUG]   PluginDir:
09:43:37 config.go:277: [DEBUG]   PluginDirSet: false
09:43:37 config.go:278: [DEBUG]   Format:
09:43:37 config.go:279: [DEBUG]   FormatSet: false
09:43:37 config.go:280: [DEBUG]   Varfiles:
09:43:37 config.go:281: [DEBUG]   Variables:
09:43:37 config.go:282: [DEBUG]   Only:
09:43:37 config.go:283: [DEBUG]   IgnoreModules:
09:43:37 config.go:287: [DEBUG]   Rules:
09:43:37 config.go:289: [DEBUG]     terraform_required_providers: false
09:43:37 config.go:289: [DEBUG]     terraform_required_version: false
09:43:37 config.go:289: [DEBUG]     terraform_deprecated_interpolation: false
09:43:37 config.go:289: [DEBUG]     terraform_typed_variables: false
09:43:37 config.go:291: [DEBUG]   Plugins:
09:43:37 config.go:293: [DEBUG]     aws: enabled=true, version=0.24.1, source=github.com/terraform-linters/tflint-ruleset-aws
09:43:37 config.go:293: [DEBUG]     terraform: enabled=true, version=, source=
09:43:37 discovery.go:33: [INFO] Plugin `terraform` is not installed, but the bundled plugin is available.
09:43:37 discovery.go:54: [INFO] Plugin `terraform` found
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:604: starting plugin: path=/home/zhuoyun/opt/tflint/tflint args=["/home/zhuoyun/opt/tflint/tflint", "--act-as-bundled-plugin"]
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:612: plugin started: path=/home/zhuoyun/opt/tflint/tflint pid=1240042
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:707: waiting for RPC address: path=/home/zhuoyun/opt/tflint/tflint
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:752: using plugin: version=11
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:1046: tflint: 09:43:37 [DEBUG] go-plugin@v1.4.10/server.go:404: plugin address: network=unix address=/tmp/plugin2751559580
09:43:37 discovery.go:90: [DEBUG] Find plugin path: /home/zhuoyun/.tflint.d/plugins/github.com/terraform-linters/tflint-ruleset-aws/0.24.1/tflint-ruleset-aws
09:43:37 discovery.go:54: [INFO] Plugin `aws` found
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:604: starting plugin: path=/home/zhuoyun/.tflint.d/plugins/github.com/terraform-linters/tflint-ruleset-aws/0.24.1/tflint-ruleset-aws args=["/home/zhuoyun/.tflint.d/plugins/github.com/terraform-linters/tflint-ruleset-aws/0.24.1/tflint-ruleset-aws"]
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:612: plugin started: path=/home/zhuoyun/.tflint.d/plugins/github.com/terraform-linters/tflint-ruleset-aws/0.24.1/tflint-ruleset-aws pid=1240055
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:707: waiting for RPC address: path=/home/zhuoyun/.tflint.d/plugins/github.com/terraform-linters/tflint-ruleset-aws/0.24.1/tflint-ruleset-aws
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:1046: tflint-ruleset-aws: 09:43:37 [DEBUG] go-plugin@v1.4.10/server.go:404: plugin address: network=unix address=/tmp/plugin1824306338
09:43:37 [DEBUG] go-plugin@v1.4.10/client.go:752: using plugin: version=11

^C

TFLint Version

0.47.0

Terraform Version

No response

Operating System

bendrucker commented 1 year ago

Hmm, this is unexpected:

press ^C to stop the process Child processes do not exit

By default ^C in a shell signals the entire process group rather than the parent itself. TFLint shouldn't necessarily have to clean up child processes, they can clean up themselves, at least for handling signals from a shell.

This seems to be explained by the plugin system explicitly ignoring SIGINTs:

https://github.com/hashicorp/go-plugin/blob/5dee41c45dcec4c3b00c6d227bec8e5ebea56a8a/server.go#L430-L441

If you enable trace logs I suspect you'd see these "plugin received interrupt signal, ignoring" logs.

This is mildly related:

https://github.com/hashicorp/go-plugin/issues/203

Systems designed for running applications rather than interactive human usage (e.g. init systems, containers) will generally signal the parent and not consider process groups. Container environments will generally SIGTERM pid 1 when stopping a container, e.g., calling docker stop or deleting a pod in Kubernetes. So in that sense a shell isn't an ideal reproduction. Sending either INT or TERM directly to the tflint process using the kill command is a good test.

https://pkg.go.dev/os/signal#hdr-Default_behavior_of_signals_in_Go_programs

It seems like we'll need to handle this signal from the main tflint process and make sure it propagates to the plugin system, hopefully by canceling a context.

The key parts of how Terraform propagates shutdown signals:

https://github.com/hashicorp/terraform/blob/8b210951d963a81d3e947c77858d76229e2c28fe/commands.go#L431-L447 https://github.com/hashicorp/terraform/blob/8b210951d963a81d3e947c77858d76229e2c28fe/internal/command/meta.go#L400-L418

wzyboy commented 1 year ago

Thanks for the triaging.

I'm not sure what signal Neovim sends to tflint --langserver but I bet it's either SIGINT or SIGKILL or SIGTERM.

I did the experiment by sending SIGINT/SIGKILL/SIGTERM to the parent process tfling --langserver and the behaviour is almost the same:

The only difference is that with SIGKILL and SIGTERM, in debug log there is one more line Killed or Terminated, respectively.

bendrucker commented 1 year ago

SIGKILL is not handleable/recoverable so generally a SIGTERM will be sent first. Kubernetes will first send a TERM and then after a configurable grace period will guarantee termination with a KILL:

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

I'm not sure what signal Neovim sends

Assuming this would be in response to quitting, e.g., :q. GPT has this to say about how Neovim interacts with LSP plugins:


For a language server-based plugin in Neovim, the termination process involves the following steps:

  1. Neovim sends a BufUnload event to the buffer associated with the plugin.
  2. The plugin, acting as a client to the language server, receives the BufUnload event.
  3. The plugin sends a shutdown request to the language server.
  4. The language server, upon receiving the shutdown request, closes any open connections, stops any ongoing processes, and performs any necessary cleanup.
  5. Once the language server acknowledges the shutdown, the plugin can terminate itself safely.

This workflow ensures that the language server is properly informed about the termination of Neovim and allows it to gracefully shut down, releasing resources and terminating any associated processes.

bendrucker commented 1 year ago

In other words, sounds like there's no process signaling at all involved here and this all happens via LSP commands:

https://github.com/terraform-linters/tflint/blob/8b582cc77d28ddf1489f8b7be65838bb4edb2116/langserver/handler.go#L131-L135

bendrucker commented 1 year ago

And in theory, this is how plugins are supposed to be cleaned up:

https://github.com/terraform-linters/tflint/blob/8b582cc77d28ddf1489f8b7be65838bb4edb2116/cmd/langserver.go#L33C1-L46

bendrucker commented 1 year ago

Do you see the Shutting down... log from the main process? Kill should be called after that (via defer), which should then attempt a graceful exit followed by a SIGKILL to the process if that errors or times out:

https://github.com/hashicorp/go-plugin/blob/a88a423a8813d0b26c8e3219f71b0f30447b5d2e/client.go#L447-L480

wzyboy commented 1 year ago

Do you see the Shutting down... log from the main process?

No. I did four experiments:

  1. stop the process with ^C
  2. kill -SIGINT the main process from another terminal
  3. kill -SIGTERM the main process from another terminal
  4. kill -SIGKILL the main process from another terminal

The debug output of experiments 1 and 2 are identical, just like I posted in the original post.

The debug output of experiments 3 and 4 has one more line Terminated and Killed. There are no other lines.

bendrucker commented 1 year ago

Ok, in that case I guess it would make sense that the default Go behavior is kicking in and the parent process terminates immediately. This does seems like it should work correctly under normal circumstances when the editor sends a shutdown RPC to the language server. I'm not confident that there's a bug here that's specific to LSP mode. Language servers are just long-running and far more likely to trigger this condition.

Signal handling would technically apply to normal mode as well. If you had a huge config or a plugin that was gets stuck, you seemingly can't interrupt that run without orphaning the plugin process, at least until it finishes its work.

wata727 commented 1 year ago

Thank you for the clarification. I think that this should trap signals (SIGINT, SIGTERM) in the same way as Terraform and disconnect the connection gracefully. https://github.com/hashicorp/terraform/blob/8b210951d963a81d3e947c77858d76229e2c28fe/internal/command/meta.go#L442-L446 https://github.com/terraform-linters/tflint/blob/e3e94369cfc379e11d4d2449c05f6e000cae91ae/cmd/langserver.go#L38-L44