Add heartbeat to facilitate graceful shutdown in error scenarios

hashicorp / terraform

Terraform enables you to safely and predictably create, change, and improve infrastructure. It is a source-available tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

Other

42.11k stars 9.47k forks source link

There are several scenarios with (for example) buggy providers, or poor decisions by the user or operating system, where terraform providers can either (apparently) exit early, or possibly be selected for something like an OOMkill, and as a byproduct terraform processes can sit waiting on responses for significant amounts of time, or encounter similar unexpected scenarios. The proposal here is to implement some sort of a lightweight heartbeat GRPC endpoint between the provider and terraform core. In this scenario, missing heartbeats could provide useful information to the providers about unexpected destruction of the parent terraform, and vice versa. I didn't see any mention of "heartbeat" or ."liveness" in the current SDK repo, but this type of solutions seems as if it could be implemented entirely transparently to most providers, if it lived in the SDK. I hope this issue makes sense and is in the right place. Here is a screenshot of some ancient terraform processes owned by pid 1 for good measure. It is my belief that I have seen similar behavior from several other plugins. While ideally this would never happen, in reality it is a scenario that occurs and might be remediable.

Based on the description here it sounds like the problem is not with Terraform Core detecting a crashed/hanged plugin, but the other way around: Terraform Core can potentially crash without terminating plugin child processes.

A plugin process starts up a server and just waits for Terraform to connect to it, so when Terraform disconnects (either normally or via crashing) the plugin can't tell without additional information whether another connection will be opened or if the parent process is just gone.

However, go-plugin (the underlying library that Terraform plugins are built around) is already starting up the standard grpc heartbeat service on the server side, and has a mechanism on the client to call it:

https://github.com/hashicorp/go-plugin/blob/809113480b559c989ea9cfcff62e9d387961f60b/grpc_server.go#L70-L74 https://github.com/hashicorp/go-plugin/blob/809113480b559c989ea9cfcff62e9d387961f60b/grpc_client.go#L110-L117

Therefore I think a good first step here would be to understand what exactly go-plugin is already doing with that heartbeat mechanism, and whether there's a way we can extend it so that for example if a plugin server does not hear a heartbeat message from Terraform Core for some reasonable amount of time it can terminate itself. I'm not sure right now if Ping is something that go-plugin periodically calls itself internally, or if that's something that the application (Terraform itself) is responsible for handling.

hashicorp / terraform

Add heartbeat to facilitate graceful shutdown in error scenarios #23527