projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

Information about interface activation #8070

Closed ymartin-ovh closed 10 months ago

ymartin-ovh commented 1 year ago

I'm facing sometimes connection issues on init containers. Looking traces I've got, I don't understand why cali* interfaces take times to be UP.

Logs filtered on cali89cc739dd58:

2023-09-29 12:30:13.074 [DEBUG][62] felix/iface_monitor.go 350: Interface changed state ifIndex=181359 ifaceName="cali89cc739dd58" newState="up" oldState="down"
2023-09-29 12:30:14.530 [INFO][62] felix/endpoint_mgr.go 1283: Skipping configuration of interface because it is oper down. ifaceName="cali89cc739dd58"
2023-09-29 12:30:14.551 [DEBUG][62] felix/route_table.go 721: Failed to add route on first attempt, retrying... error=file exists ifaceName="cali89cc739dd58" ifaceRegex="^cali.*" ipVersion=0x4 route={Ifindex: 181359 Dst: 192.168.44.4/32 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} tableIndex=254
2023-09-29 12:30:14.551 [DEBUG][62] felix/route_table.go 1200: Interface still exists error=netlink update operation failed ifaceName="cali89cc739dd58" ifaceRegex="^cali.*" ipVersion=0x4 link=&netlink.Veth{LinkAttrs:netlink.LinkAttrs{Index:181359, MTU:1440, TxQLen:1000, Name:"cali89cc739dd58", HardwareAddr:net.HardwareAddr{0xee, 0xee, 0xee, 0xee, 0xee, 0xee}, Flags:0x13, RawFlags:0x11043, ParentIndex:8, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0xc001b26480), Promisc:0, Allmulti:0, Multi:1, Xdp:(*netlink.LinkXdp)(0xc007944330), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x6, PhysSwitchID:0, NetNsID:82, NumTxQueues:1, NumRxQueues:1, GSOMaxSize:0x10000, GSOMaxSegs:0xffff, GROMaxSize:0x0, Vfs:[]netlink.VfInfo(nil), Group:0x0, Slave:netlink.LinkSlave(nil)}, PeerName:"", PeerHardwareAddr:net.HardwareAddr(nil), PeerNamespace:interface {}(nil)} tableIndex=254
2023-09-29 12:30:14.551 [DEBUG][62] felix/route_table.go 1204: Failed to access interface but it appears to be up; retrying... error=netlink update operation failed ifaceName="cali89cc739dd58" ifaceRegex="^cali.*" ipVersion=0x4 link=&netlink.Veth{LinkAttrs:netlink.LinkAttrs{Index:181359, MTU:1440, TxQLen:1000, Name:"cali89cc739dd58", HardwareAddr:net.HardwareAddr{0xee, 0xee, 0xee, 0xee, 0xee, 0xee}, Flags:0x13, RawFlags:0x11043, ParentIndex:8, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0xc001b26480), Promisc:0, Allmulti:0, Multi:1, Xdp:(*netlink.LinkXdp)(0xc007944330), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x6, PhysSwitchID:0, NetNsID:82, NumTxQueues:1, NumRxQueues:1, GSOMaxSize:0x10000, GSOMaxSegs:0xffff, GROMaxSize:0x0, Vfs:[]netlink.VfInfo(nil), Group:0x0, Slave:netlink.LinkSlave(nil)}, PeerName:"", PeerHardwareAddr:net.HardwareAddr(nil), PeerNamespace:interface {}(nil)} tableIndex=254
2023-09-29 12:30:17.202 [INFO][62] felix/int_dataplane.go 1893: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"cali89cc739dd58", State:"down", Index:181359}
2023-09-29 12:30:17.202 [DEBUG][62] felix/endpoint_mgr.go 331: Received message msg=&intdataplane.ifaceStateUpdate{Name:"cali89cc739dd58", State:"down", Index:181359}
2023-09-29 12:30:17.202 [DEBUG][62] felix/endpoint_mgr.go 350: Interface state changed. update=&intdataplane.ifaceStateUpdate{Name:"cali89cc739dd58", State:"down", Index:181359}
2023-09-29 12:30:17.202 [DEBUG][62] felix/wireguard_mgr.go 64: Received message ipVersion=0x4 msg=&intdataplane.ifaceStateUpdate{Name:"cali89cc739dd58", State:"down", Index:181359}
2023-09-29 12:30:17.202 [INFO][62] felix/int_dataplane.go 1913: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"cali89cc739dd58", Addrs:set.Typed[string]{}}
2023-09-29 12:30:17.202 [INFO][62] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"cali89cc739dd58", Addrs:set.Typed[string]{}}
2023-09-29 12:30:17.222 [DEBUG][62] felix/hostip_mgr.go 86: Not a real host interface, ignoring. update=&intdataplane.ifaceAddrsUpdate{Name:"cali89cc739dd58", Addrs:set.Typed[string]{}}
2023-09-29 12:30:17.224 [DEBUG][62] felix/endpoint_mgr.go 353: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"cali89cc739dd58", Addrs:set.Typed[string]{}}
2023-09-29 12:30:17.224 [DEBUG][62] felix/endpoint_mgr.go 355: Workload interface, ignoring. update=&intdataplane.ifaceAddrsUpdate{Name:"cali89cc739dd58", Addrs:set.Typed[string]{}}
2023-09-29 12:30:17.224 [DEBUG][62] felix/endpoint_mgr.go 350: Interface state changed. update=&intdataplane.ifaceStateUpdate{Name:"cali89cc739dd58", State:"up", Index:181359}

Those lines puzzle me:

2023-09-29 12:30:14.530 [INFO][62] felix/endpoint_mgr.go 1283: Skipping configuration of interface because it is oper down. ifaceName="cali89cc739dd58"
2023-09-29 12:30:14.551 [DEBUG][62] felix/route_table.go 721: Failed to add route on first attempt, retrying... error=file exists 

On our cluster deployment, we use calico network policies to manage network flows. Can you tell me that all network policies must be activated / enabled on new pod deployment in order to UP the network interface ?

Your Environment

mgleung commented 1 year ago

@ymartin-ovh , cali interfaces are our virtual interfaces that we create to manage the networking for a container. I imagine that you are seeing some time for them to come up for an init container because init containers are created then removed so we should be creating a cali interface for it so that it has connectivity, and then removing it when the init container is removed.

On our cluster deployment, we use calico network policies to manage network flows. Can you tell me that all network policies must be activated / enabled on new pod deployment in order to UP the network interface ?

I don't think that network policies should be controlling the status of the network interfaces. Is there a specific reason that you've come to that conclusion? I would have thought that if we are looking at an init container, we would see some log messages like this since the interface should be removed when the container dies. @caseydavenport anything else you think we should check?

caseydavenport commented 1 year ago

because init containers are created then removed so we should be creating a cali interface for it so that it has connectivity, and then removing it when the init container is removed.

One thing I'd say - the interface shouldn't be removed after the init container finishes. The netns is shared between the init containers and other containers.

The Calico CNI plugin doesn't set the new interfacte to oper UP until it has completed its configuration of it. This is expected, and the logs you're seeing indicate normal operation of Felix, who will see the interface immediately before the CNI plugin has finished configuring it (you can see they are INFO and DEBUG level and not WARN or ERROR).

caseydavenport commented 1 year ago

Can you tell me that all network policies must be activated / enabled on new pod deployment in order to UP the network interface ?

To answer this specifically - no, we don't wait for all network policies to be enabled before setting oper UP. However, new interfaces will automatically receive no network connectivity until policies are applied for other reasons. Namely because the default rules drop traffic until Felix finishes its programming.

ymartin-ovh commented 1 year ago

@caseydavenport / @mgleung

Thanks for the feedback and details you gave me. I asked about interface state and network policy dependency because in my logs between 12:30:14 and 12:30:17, I see a lot of logs like:

[DEBUG][62] felix/rules.go 169: Hashed rule action=...

For now, I don't understand why interface that should be UP in few first line on logs is still in down state:

2023-09-29 12:30:13.074 [DEBUG][62] felix/iface_monitor.go 350: Interface changed state ifIndex=181359 ifaceName="cali89cc739dd58" newState="up" oldState="down"
2023-09-29 12:30:14.530 [INFO][62] felix/endpoint_mgr.go 1283: Skipping configuration of interface because it is oper down. ifaceName="cali89cc739dd58"

logs about interface: cali89cc739dd58.log

caseydavenport commented 1 year ago

[DEBUG][62] felix/rules.go 169: Hashed rule action=...

These logs just indicate the rules being programmed by Felix. They are at DEBUG level.

Felix is waiting until the CNI plugin marks the interface as UP before programming it.

sridhartigera commented 10 months ago

@ymartin-ovh Closing this issue. Feel free to reach out if you have further questions.