Open terlar opened 1 year ago
Hi @terlar! Thanks for reporting this.
Unfortunately the limit on the possible length of the path to a Unix domain socket is something decided by your operating system and not something Terraform can directly control. Therefore I'm not sure what we could change that would make this better, other than perhaps to handle this error better.
I think the reason this failed so oddly is because the problem is actually happening inside the hashicorp/local
provider plugin, and so it's failing to complete the plugin handshake. Terraform Core therefore only knows that the plugin didn't speak the protocol correctly, and cannot determine why because there's not yet any real communication channel between the two -- the socket it was trying to create is the communication channel.
Therefore I think while this is unfortunate there might not be anything we could do to improve this. While looking to confirm that I was understanding this problem correctly I found https://github.com/golang/go/issues/6895 which seems to confirm that Go is just passing the path on to the bind
function and the OS is the one declaring that it's an invalid argument. On Linux the maximum supported length is 108 characters.
There is a passing mention in that issue about the "abstract namespace" for Unix domain sockets, which is a Linux-specific extension. We could perhaps investigate that as an alternative protocol when running on Linux systems in particular, but since limitations on Unix domain socket path length exist on most (all?) Unix-derived OSes that would not be a complete solution.
Interesting, it was indeed very confusing and a bit tricky to track down. I did see the [bind](bind: invalid argument)
error but wasn't sure where it came from and was not easy to search for.
Perhaps the path exceeding 108
characters could be detected early and raise a warning/affect the final error message. Or would that limit other platforms?
I think the fact that this logic lives inside the provider plugins themselves will be a limiting factor on all solutions: until we have the communication channel available there isn't really any way for the plugin to tell Terraform Core why it's failing, aside from just writing arbitrary logs to stderr (which Terraform Core just copies verbatim into its own logs, because they are human-oriented logs rather than machine-oriented logs.)
I think one possible path here would be to extend the plugin handshake protocol with an explicit failure response. Today the only valid response from a provider during handshake is to announce which Unix socket it's listening on (along with some other session metadata), but we could potentially teach providers to respond to certain handshake-related errors by writing something we can unambiguously recognize as an error instead of the normal handshake message, and therefore let providers pass an arbitrary failure message hopefully saying a little more about why they couldn't initialize.
With that in place, the provider-side protocol implementation code could potentially notice when the socket bind returns "invalid argument" and return a more useful error message, which would hopefully include a hint about temporary directory path lengths if the plugin knows it was trying to create a Unix socket when it got that error.
Another possibility would be for the plugins to try to fall back to using a TCP socket if they fail to bind to a Unix socket with this particular error. The plugin protocol supports both Unix sockets and TCP sockets at the option of the plugin, but today the rule is that TCP sockets only get used when running on Windows while Unix sockets get used on all other platforms. I think if we wanted to go this route we'd need to do some further research to think about whether there would be any unwanted consequences of that fallback -- for example, if running on a system where opening a TCP listen socket on localhost would be problematic for some reason -- but technically I think the protocol already has everything it needs to be able to support that automatic negotiation without any changes to Terraform Core.
Terraform Version
Terraform Configuration Files
Debug Output
Expected Behavior
Validation should work without errors.
Actual Behavior
Plugin resolution fails:
Steps to Reproduce
TMPDIR="$(mktemp --directory --suffix -this-is-a-very-very-very-very-very-very-very-very-very-very-long-long-path)"
export TMPDIR
export TF_LOG=TRACE
terraform init -backend=false
terraform validate
Additional Context
I discovered this when using a build environment that sets a unique TMPDIR for each build and it is based on the name of the build target. Which in my case turned out to be quite long.
References
No response