siderolabs / cluster-api-control-plane-provider-talos

A control plane provider for CAPI + Talos
Mozilla Public License 2.0
60 stars 20 forks source link

GCP Firewall rules not correct #127

Closed xunholy closed 1 year ago

xunholy commented 2 years ago

Using CAPI to create Talos nodes in GCP @andrewrynhard and I discovered there were several I/O timeouts in the serial logs from the nodes and from what we could see within the controller logs.

error copying: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 34.129.80.10:50000: i/o timeout"

This lead us to the GCP FW rules and we noticed that port 50000 was not open from external to GCP - This would be perfectly fine if the management cluster was likely also in GCP however, there is a chicken and egg situation where the management cluster will initially at least temporarily exist outside GCP and in my scenario it was running on KIND locally.

Hence when attempting to bootstrap the FW rules were blocking my connection.

I also tried to nc and telnet and these appeared like correct responses initially but it wasn't until we added the new FW rule that the connection began working.

frezbo commented 2 years ago

nc or netcat against all major cloud providers network would say the port is open even if firewall if blocking them, so to actually test it you'd need to call some api

smira commented 2 years ago

I'm not sure how this can be fixed given that it should be something in the GCP CAPI infrastructure provider to open these ports, it might even be available in infrastructure manifests. CACPPT is a generic provider, it can't open ports on GCP

andrewrynhard commented 2 years ago

@rsmitty How does this work in CI? Our management cluster is in EM IIRC so it should be a similar setup.

smira commented 2 years ago

looks like network is specified as part of the manifest, so it should be pre-configured: https://github.com/siderolabs/cluster-api-templates/blob/main/gcp/standard/standard.yaml#L28-L29

rsmitty commented 2 years ago

Yep, this is exactly it. The network we use for CI has default fw rules that allow 50000 and 6443

rsmitty commented 2 years ago

Andrey is also right that this is something that we can't handle from the CACPPT side. The default behavior with the GCP infra provider is that it'll create its own network to use and whatnot, but what it creates doesn't have the firewall rules we need. Thus the reason we "bring our own" network for it.

xunholy commented 2 years ago

So is the recommendation to pre-bake my own networks with port 50000 enabled? I can't say I've investigated but wonder if there is a way to define additional fw rules through the provider which I know isn't in the realm of the Talos problem - What has been done and/or recommended to other users running Talos in GCP to date?

smira commented 2 years ago

Our recommendation is the above - create a pre-configured GCP network and reference it in the cluster manifest. We are not aware at the moment of a better way to do that.