okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

GCP IPI Master nodes can not use cluster services through router #932

Closed bdurrow closed 3 months ago

bdurrow commented 3 years ago

Describe the bug

Since upgrading from 4.7.0-0.okd-2021-09-19-013247 to 4.8.0-0.okd-2021-10-10-030117 master nodes are no longer able to access services that go through the router's GCP provided load balancer. This is true for both public and private load balancers.

Our use case is that we host a container registry in the cluster and at least one of our DaemonSets uses an image from that cluster hosted container registry. While previously Master nodes could use that container registry now they get with "Connection Refused" (tcp rst). We have but a public facing default router and a private router and endpoints neither are accessible using curl from Master nodes.

I have confirmed that this was working on a 4.7.0-0.okd-2021-09-19-013247 cluster.

Version

OKD GCP IPI 4.8.0-0.okd-2021-10-10-030117

How reproducible

100%

Log bundle

llomgui commented 3 years ago

Hello,

Can you confirm route are there?

Did you try to install a new cluster? I tried to install a cluster on GCP - But it failed. https://github.com/openshift/okd/discussions/920#discussioncomment-1502847

bdurrow commented 3 years ago

@llomgui, I have not recently installed cluster on GCP. The cluster in question is just over a year old. I confirmed with netstat -rn that the routes are as expected. Specifically that Destination 0.0.0.0/0.0.0.0 is pointed at the first address in the subnet and it is using interface br-ex.

vrutkovs commented 3 years ago

Please check if its fixed in 4.8.0-0.okd-2021-10-24-061736, installation in CI passed

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

bdurrow commented 2 years ago

Sorry I missed that there was a question here. It is still happening as of 4.9.

My cluster has two ingress routers, one public facing named default and one private facing named private. OKD configured the google load balancers.

The public facing load balancer is configured: Type: Network (target pool-based) TCP IP:Port: Some Public IP:80-443 Health Check: GET /healthz via port 32323 Health Check Status: All failing except for nodes with default router pods

I ran three tests from some representative nodes:

  1. curl -v http://Public IP Address/
  2. curl -vk https://Public IP Address/
  3. curl -v http://127.0.0.1:32323/healthz/
Node Type curl to LB Public IP port 80 curl to LB Public IP port 443 curl -v http://127.0.0.1:32323/healthz/
Master Node Connection Refused. Connection Refused. 503
Worker Node running default router pod Works as expected Works as expected 200
Worker Node not running default router pod Works as expected Works as expected 503
sh-5.1# curl -v http://127.0.0.1:32323/healthz/
*   Trying 127.0.0.1:32323...
* Connected to 127.0.0.1 (127.0.0.1) port 32323 (#0)
> GET /healthz/ HTTP/1.1
> Host: 127.0.0.1:32323
> User-Agent: curl/7.76.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< Content-Type: application/json
< Date: Wed, 26 Jan 2022 22:22:32 GMT
< Content-Length: 98
< 
* Connection #0 to host 127.0.0.1 left intact
{ "service": { "namespace": "openshift-ingress", "name": "router-default" }, "localEndpoints": 0 }sh-5.1#

For the private load balancer, it is configured: Type: TCP/UDP (Internal) IP:Port: Some Private IP:80-443 Health Check: GET /healthz via port 32178 Health Check Status: All failing except for nodes with private router pods

Node Type curl to LB Private IP port 80 curl to LB Private IP port 443 curl -v http://127.0.0.1:32178/healthz/
Master Node Connection Refused. Connection Refused. 503
Worker Node running private router pod Works as expected Works as expected 200
Worker Node not running private router pod Works as expected Works as expected 503
bdurrow commented 2 years ago

I am not sure because it is pretty hard for me to grok iptables rules but it looks to me like the problem is related to the gcp-vip-local rules (the workers don't have these rules). My best guess is that packets destined for the load balancer are redirected to the localhost before the address translation happens. To test my theory I tried using curl to hit the load balancer IP port 32323. On the master node I get the same kind of response if I were to curl -v http://127.0.0.1:32323/healthz but if I try the same from a worker node I get no response and eventually it times out. I am not sure why the masters do this differently than the workers but it seems like the way the workers does it works and the way the masters do it doesn't.

I think that these firewall rules were probably developed with the API server in mind. We don't have a problem there with the default configuration because port traffic intended for the load balancer is on port 6443 and the master node has a service listening there.

bdurrow commented 2 years ago

/remove-lifecycle stale

bdurrow commented 2 years ago

This behavior has changed in okd 4.10 (discovered when upgrading from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213). Workers which are not running the router service now return connection refused (the same behavior we saw for master nodes before). I see the same behavior no matter if I try to access the public or private router (we have added an internal router that is available through an internal gcp provided load balancer as configured by OpenShift).

It looks the relevant iptables behavior was previously configured by openshift-gcp-routes.service (executing /opt/libexec/openshift-gcp-routes.sh) on the master nodes but is now configured by gcp-routes.service (executing /usr/sbin/gcp-routes.sh) on all nodes (both are present and enabled on the master nodes which is probably not the desired state).

When I diff /opt/libexec/openshift-gcp-routes.sh and /usr/sbin/gcp-routes.sh it looks as if they share a common ancestor.

To restate how this impacts our use case: We run nexus in our cluster. We host an image in nexus that is used by a Deployment. The image pull fails on nodes that are not running the router service. A possible workaround would be to use a ImageContentPolicy to redirect to the nexus service directly but the nodes don't honor the cluster DNS and ImageContentPolicy schema does not allow a component of the hostname to start with a number so IP addresses are not allowed (we can't even use something like nip.io. Issue filed here

I'll try to find where to submit issues for all of these problems I have identified and also try to figure out a way to solve the issue. If anyone could point me to where each of these components live I would appreciate it.

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale