Openshift deployment using assisted installer - no network with antrea as primary cni

jsalatiel commented 9 months ago

Describe the bug

Since I could not find any documentation about how to install antrea on openshift using their new install method (openshift assisted installer) I used Calico's documentation (making the required adjustments) to install antrea as the primary CNI. That basically means configure everything on redhat console panel, including all manifests from the deploy folder and before effectively click "install" issue the following POST.

curl \
  --header "Content-Type: application/json" \
  --request PATCH \
  --data '"{\"networking\":{\"networkType\":\"antrea\"}}"' \
  -H "Authorization: Bearer $TOKEN" \
  "https://$ASSISTED_SERVICE_API/api/assisted-install/v2/clusters/$CLUSTER_ID/install-config"

The installation finishes successful and I can see all pods in running state.

Antrea also appears to be the primary CNI:

oc describe network.config/cluster
Name:         cluster
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         Network
Metadata:
  Creation Timestamp:  2023-12-27T17:28:50Z
  Generation:          2
  Resource Version:    3345
  UID:                 93a2f6fc-7845-4c40-ba9f-aec70329c729
Spec:
  Cluster Network:
    Cidr:         10.128.0.0/14
    Host Prefix:  23
  External IP:
    Policy:
  Network Type:  antrea
  Service Network:
    172.30.0.0/16
Status:
  Cluster Network:
    Cidr:         10.128.0.0/14
    Host Prefix:  23
  Network Type:   antrea
  Service Network:
    172.30.0.0/16
Events:  <none>

The problem is that all pods (not on hostNetwork) have no connectivity to outside the cluster. Pods can connect to themselves, nothing else.

bash-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if153: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether b6:6c:d6:f8:62:18 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.0.148/23 brd 10.128.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::b46c:d6ff:fef8:6218/64 scope link
       valid_lft forever preferred_lft forever

bash-5.1# ip route
default via 10.128.0.1 dev eth0
10.128.0.0/23 dev eth0 proto kernel scope link src 10.128.0.148
bash-5.1# ping -c1  10.128.0.1
PING 10.128.0.1 (10.128.0.1) 56(84) bytes of data.
64 bytes from 10.128.0.1: icmp_seq=1 ttl=64 time=1.20 ms

--- 10.128.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.200/1.200/1.200/0.000 ms
bash-5.1# ping -w3 -c5 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2057ms

bash-5.1# curl -Lv www.google.com
*   Trying 142.250.79.164:80...
*   Trying 2800:3f0:4004:808::2004:80...
* Immediate connect fail for 2800:3f0:4004:808::2004: Network unreachable

Reproduction steps

Used openshift assisted installer to install antrea as primary cni
no network

Expected behavior

Network should be fine

Additional context

Trace packets fail:

antctl trace-packet -S kube-system/pqp -D 8.8.8.8  -f udp,udp_dst=53
syntax error at br-int (or the bridge name was omitted)
ovs-appctl: /var/run/openvswitch/ovs-vswitchd.92.ctl: server returned an error

jsalatiel commented 9 months ago

I have added the support bundle here: https://fastupload.io/bSD9eHRH2c8f0wU/file

tnqn commented 9 months ago

@jsalatiel can you check sysctl net.ipv4.ip_forward on the Nodes? I suspect Openshift doesn't enable it by default. If it's 0, you may enable it by sysctl -w net.ipv4.ip_forward=1. If this is the cause, I'm thinking if we should do it by default as it seems relying K8s components to do it seems not working in some cases.

For antctl trace-packet, it may be a bug, I created https://github.com/antrea-io/antrea/issues/5831 to track it.

jsalatiel commented 9 months ago

Hi @tnqn , it worked , tks! In all my previous tests I was doing a single node installation In that mode the installation would finish and I could SSH to the single node, but I would not get connectivity from the pods as I mentioned in this ticket.

After you mentioned the net.ipv4.ip_forward, i tried using a 3 node cluster. The installation never finishes ( aborts as stalled ). So I destroyed the cluster and created a new one, and I noticed that all the nodes also had net.ipv4.ip_forward=0, I manually set those to net.ipv4.ip_forward=1 in the middle of the installation and the installation finished successfully.

So it would be really nice if antrea could do that net.ipv4.ip_forward=1 by itself mainly because of the readonly nature of redhat core OS.

jsalatiel commented 9 months ago

The remaining problem is that for openshift 4.14.x antrea is not certified thus the third-party collaborative support between Redhat and Vmware wont apply if I use antrea on 4.14. I have opened #99 for that although I have no idea how that certification process works.

vmware / antrea-operator-for-kubernetes