kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.89k stars 433 forks source link

[BUG] TCP DNS Traffic Blocked Despite Security Group Rule Allowing Egress to DNS Service #3998

Open wfnuser opened 2 months ago

wfnuser commented 2 months ago

Kube-OVN Version

v1.12.8

Kubernetes Version

Server Version: v1.26.9

Operation-system/Kernel Version

"Ubuntu 22.04.2 LTS"

Description

We have encountered an issue in our Kubernetes cluster managed by Kube-OVN where a security group (SG) rule is configured to allow egress traffic from a specific pod to the DNS service at the Cluster IP 10.96.0.10. According to our configuration, this rule should permit all traffic to the DNS service. However, we are observing unexpected behavior with different protocols:

Our sg looks like:

apiVersion: kubeovn.io/v1
kind: SecurityGroup
metadata:
  creationTimestamp: "2024-05-07T09:01:09Z"
  generation: 30
  name: user-8281-sg
  resourceVersion: "645915870"
  uid: cfeb4eea-18fb-4d65-9b89-8befd946dd3e
spec:
  allowSameGroupTraffic: true
  egressRules:
  - ipVersion: ipv4
    policy: allow
    priority: 30
    protocol: all
    remoteAddress: 10.96.0.10
    remoteType: address
  - ipVersion: ipv4
    policy: deny
    priority: 31
    protocol: all
    remoteAddress: 10.0.0.0/8
    remoteType: address
  - ipVersion: ipv4
    policy: allow
    priority: 200
    protocol: all
    remoteAddress: 0.0.0.0/0
    remoteType: address

And if we add one more rule for the pod IP () behind dns service (10.96.0.10)

  - ipVersion: ipv4
    policy: allow
    priority: 30
    protocol: all
    remoteAddress: 10.16.41.31
    remoteType: address

The dns will work again.

I think the real problem is not related to dns. If there are other pod ip behind service ip, you set allow rules only for service ip. It seems simply not working. You have to set allow rules for pod ip too.

Steps To Reproduce

Create sg like following:

apiVersion: kubeovn.io/v1
kind: SecurityGroup
metadata:
  creationTimestamp: "2024-05-07T09:01:09Z"
  generation: 30
  name: user-8281-sg
  resourceVersion: "645915870"
  uid: cfeb4eea-18fb-4d65-9b89-8befd946dd3e
spec:
  allowSameGroupTraffic: true
  egressRules:
  - ipVersion: ipv4
    policy: allow
    priority: 30
    protocol: all
    remoteAddress: 10.96.0.10 (dns service cluster ip)
    remoteType: address
  - ipVersion: ipv4
    policy: deny
    priority: 31
    protocol: all
    remoteAddress: 10.0.0.0/8
    remoteType: address
  - ipVersion: ipv4
    policy: allow
    priority: 200
    protocol: all
    remoteAddress: 0.0.0.0/0
    remoteType: address

bind it to some pod.

image

It's very intereting that you can ping and even dig rds-3r4ybkarqxwg-pxc.user-1993.svc.cluster.local srv successfuly.

Current Behavior

Cannot access dns without adding pod ip to sg.

Expected Behavior

Can access dns with only service ip in sg.

wfnuser commented 2 months ago
image

This is the logs for acl rules :

from-lport  2270 (inport == @ovn.sg.user.8281.sg && ip4 && ip4.dst == 10.100.27.20) allow-related log(severity=info)
from-lport  2270 (inport == @ovn.sg.user.8281.sg && ip4 && ip4.dst == 10.16.61.90) allow-related log(severity=info)

And the command I run is:

curl 10.100.27.20

As you can see, from ACL's perspective, only the SYNC TCP packet's dst ip is 10.100.27.20, which is the cluster ip. All the following TCP packet's dst ip is somehow converted to 10.16.61.90 which is the pod ip.

github-actions[bot] commented 2 weeks ago

Issues go stale after 60d of inactivity. Please comment or re-open the issue if you are still interested in getting this issue fixed.

wfnuser commented 2 weeks ago

Any update on this issue?

bobz965 commented 2 weeks ago

Any update on this issue?

Sorry, too busy to fix this.

bobz965 commented 2 weeks ago

you are using default vpc ovn-cluster ?

how about setting ENABLE_LB false ?

wfnuser commented 2 weeks ago

you are using default vpc ovn-cluster ?

how about setting ENABLE_LB false ? Yep. Default vpc.

We have find another way to work around. Haha. I just comment to remind you there is a issue, maybe you can check it out when you have time. It seems github will automatically close this issue if I don't.

bobz965 commented 2 weeks ago

In my opinion:

when enabling lb, the vip could nated as its backend IP by switch lb. so the traffic blocked.

if you disable lb, the traffic to the VIP will go through the node, nated by ipvs