projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.89k stars 1.31k forks source link

Failed to attach bpf program to pod interface #8897

Closed xander-sh closed 1 month ago

xander-sh commented 3 months ago

Context

After upgrading calico to 3.27.2 version in clusters with PostgreSQL managed by Patroni and depoyed by zalando operator, we started to see errors in the pods with DB.

2024-06-01T07:40:03+03:00  2024-06-01 04:40:03 UTC [35341]: [1-1] 665aa623.8a0d 0 [unknown] [unknown] [unknown] 10.222.59.107 LOG:  connection received: host=10.222.59.107 port=35708
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [2-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  connection authenticated: identity="adm" method=md5 (/home/postgres/pgdata/pgroot/data/pg_hba.conf:12)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [3-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  connection authorized: user=adm database=airflow SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35341]: [4-1] 665aa623.8a0d 0 airflow adm [unknown] 10.222.59.107 LOG:  disconnection: session time: 0:00:00.008 user=adm database=airflow host=10.222.59.107 port=35708
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [1-1] 665aa623.8a0e 0 [unknown] [unknown] [unknown] 10.222.20.175 LOG:  connection received: host=10.222.20.175 port=42608
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [2-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  connection authenticated: identity="adm" method=md5 (/home/postgres/pgdata/pgroot/data/pg_hba.conf:12)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [3-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  connection authorized: user=adm database=airflow SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35342]: [4-1] 665aa623.8a0e 0 airflow adm [unknown] 10.222.20.175 LOG:  disconnection: session time: 0:00:00.012 user=adm database=airflow host=10.222.20.175 port=42608
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [1-1] 665aa623.8a10 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [1-1] 665aa623.8a0f 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [1-1] 665aa623.8a11 0 [unknown] [unknown] [unknown] [local] LOG:  connection received: host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [2-1] 665aa623.8a0f 0 airflow postgres [unknown] [local] LOG:  connection authorized: user=postgres database=airflow application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [2-1] 665aa623.8a10 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [2-1] 665aa623.8a11 0 postgres postgres [unknown] [local] LOG:  connection authorized: user=postgres database=postgres application_name=pgq ticker
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35344]: [3-1] 665aa623.8a10 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.002 user=postgres database=postgres host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35343]: [3-1] 665aa623.8a0f 0 airflow postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.003 user=postgres database=airflow host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03 UTC [35345]: [3-1] 665aa623.8a11 0 postgres postgres pgq ticker [local] LOG:  disconnection: session time: 0:00:00.002 user=postgres database=postgres host=[local]
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.583 28 LOG {ticks: 0, maint: 0, retry: 0}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03,952 INFO: no action. I am (newfmm-db-airflow-1), the leader with the lock
2024-06-01T07:40:18+03:00 2024-06-01 04:40:13,846 INFO: Lock owner: newfmm-db-airflow-1; I am newfmm-db-airflow-1
2024-06-01T07:40:18+03:00 2024-06-01 04:40:18,853 ERROR: Request to server https://10.222.64.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.222.64.1', port=443): Read timed out. (read timeout=4.999803479760885)",)
2024-06-01T07:40:20+03:00 2024-06-01 04:40:20,501 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7fef011b63c8>, 'Connection to 10.222.64.1 timed out. (connect timeout=1.115929365158081)')': /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow
2024-06-01T07:40:21+03:00 2024-06-01 04:40:21,619 ERROR: Request to server https://10.222.64.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.222.64.1', port=443): Max retries exceeded with url: /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7feef9d858d0>, 'Connection to 10.222.64.1 timed out. (connect timeout=1.115929365158081)'))",)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 ERROR: Request to server https://10.222.64.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.222.64.1', port=443): Max retries exceeded with url: /api/v1/namespaces/airflow/endpoints/newfmm-db-airflow (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x7feef9d855f8>, 'Connection to 10.222.64.1 timed out. (connect timeout=1)'))",)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 INFO: Demoting self (immediate-nolock)
2024-06-01T07:40:23+03:00 2024-06-01 04:40:23,572 ERROR: failed to update leader lock

At the same time wee see errors in calico node. There are those errors:

2024-06-01T07:40:03+03:00  2024-06-01 04:40:03.948 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.949 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:40:03+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/bpf_ep_mgr.go 1659: Failed to add policy to workload, removing from iptables allow list error=attaching program to wep: error attaching tc program invalid argument id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.968 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.969 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.976 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:40:03+03:00 2024-06-01 04:40:03.976 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}

If we check logs before those errors we can see that BPF programs were updated for that inteface every 10 seconds for a long period of time.

2024-06-01T07:39:33+03:00  2024-06-01 04:39:33.953 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.953 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 2539: /proc/sys/net/ipv4/conf/cali58c64a0c8be/accept_local set to 1
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 1824: Finished applying BPF programs for workload ifaceName="cali58c64a0c8be" timeTaken=12.953379ms
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/bpf_ep_mgr.go 1644: Updated workload interface. iface="cali58c64a0c8be"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.966 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.967 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.971 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:39:33+03:00 2024-06-01 04:39:33.971 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:39:43+03:00 2024-06-01 04:39:43.950 [INFO][69] felix/calc_graph.go 507: Local endpoint updated id=WorkloadEndpoint(node=stage1-6495d69d65xsx98l-tkf2b, orchestrator=k8s, workload=airflow/newfmm-db-airflow-1, name=eth0)
2024-06-01T07:39:43+03:00 2024-06-01 04:39:43.950 [INFO][69] felix/int_dataplane.go 1954: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" > endpoint:<state:"active" name:"cali58c64a0c8be" profile_ids:"kns.airflow" profile_ids:"ksa.airflow.postgres-pod" ipv4_nets:"10.222.59.119/32" > 
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.003 [INFO][69] felix/attach.go 174: Program attached to TC. attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.035 [INFO][69] felix/attach.go 174: Program attached to TC. attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.039 [INFO][69] felix/bpf_ep_mgr.go 1905: Attached programs to the WEP hook=egress iface="cali58c64a0c8be" type="workload"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.040 [INFO][69] felix/bpf_ep_mgr.go 1905: Attached programs to the WEP hook=ingress iface="cali58c64a0c8be" type="workload"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.040 [INFO][69] felix/bpf_ep_mgr.go 1824: Finished applying BPF programs for workload ifaceName="cali58c64a0c8be" timeTaken=89.975517ms
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/bpf_ep_mgr.go 2539: /proc/sys/net/ipv4/conf/cali58c64a0c8be/accept_local set to 1
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/bpf_ep_mgr.go 1644: Updated workload interface. iface="cali58c64a0c8be"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-0" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-2" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-6" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-8" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-9" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-a" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-c" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch-d" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/table.go 521: Queueing update of chain. chainName="cali-to-wl-dispatch" ipVersion=0x4 table="filter"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.041 [INFO][69] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.042 [INFO][69] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="cali58c64a0c8be"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.044 [INFO][69] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"}
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} ipVersion=0x4 status="up"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"airflow/newfmm-db-airflow-1", EndpointId:"eth0"} status="up"
2024-06-01T07:39:44+03:00 2024-06-01 04:39:44.049 [INFO][69] felix/summary.go 100: Summarising 17 dataplane reconciliation loops over 1m0s: avg=20ms longest=99ms (resync-filter-v4,update-workload-iface)

The inteface cali58c64a0c8be lacks some bpf programs: calid2e5dc9 - a healthy interface cali58c64a0 - a faulty interface

root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep calid2e5d
7766942: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
7766943: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131461: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131483: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58c64a0
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8
root@stage1-6495d69d65xsx98l-tkf2b:/# ls  /sys/fs/bpf/tc/cali58c64a0c8be_*
/sys/fs/bpf/tc/cali58c64a0c8be_egr:
cali_jump2

/sys/fs/bpf/tc/cali58c64a0c8be_igr:
cali_jump2

Using calico-bpf we see the following information:

─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf ifstate dump | grep cali58c64a0c8be
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
 2656 : {flags: workload,ready, XDPPolicy: -1, IngressPolicy: 24, EgressPolicy: 25, IngressFilter: -1, EgressFilter: -1, name: cali58c64a0c8be}

 ╰─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf counters dump --iface=cali58c64a0c8be
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
+----------+--------------------------------+-----------+---------+-----+
| CATEGORY |              TYPE              |  INGRESS  | EGRESS  | XDP |
+----------+--------------------------------+-----------+---------+-----+
| Accepted | by another program             |         0 |       0 | N/A |
|          | by failsafe                    |         0 |       0 | N/A |
|          | by policy                      |   1046273 | 1139052 | N/A |
| Dropped  | by policy                      |         0 |       0 | N/A |
|          | failed decapsulation           |         0 |       0 | N/A |
|          | failed encapsulation           |         0 |       0 | N/A |
|          | incorrect checksum             |         0 |       0 | N/A |
|          | malformed IP packets           |         0 |       0 | N/A |
|          | packets with unknown route     |         0 |       0 | N/A |
|          | packets with unknown source    |         0 |       0 | N/A |
|          | packets with unsupported IP    |         0 |       0 | N/A |
|          | options                        |           |         |     |
|          | too short packets              |         0 |       0 | N/A |
| Total    | packets                        | 166970130 | 2528111 | N/A |
+----------+--------------------------------+-----------+---------+-----+
###dumped cali58c64a0c8be counters.

╰─$ kubectl -n kube-system exec calico-node-lggqf -- calico-node -bpf policy dump cali58c64a0c8be all
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
IfaceName: cali58c64a0c8be
Hook: tc egress
Error:
Policy Info:
// Start of rule action:"allow" rule_id:"cKE9PSPlsXvdnn0_"
// count = 0
IfaceName: cali58c64a0c8be
Hook: tc ingress
Error:
Policy Info:
// Start of rule action:"allow" rule_id:"QSpjq8vswFLjKjfu"
// count = 53409
2024-06-10 06:55:18.476 [ERROR][11151] confd/policy_debug.go 79: Failed to dump policy info. error=stat /var/run/calico/bpf/policy/cali58c64a0c8be_xdp_v4.json: no such file or directory

Expected Behavior

Those errors should not happen and the connection should stay stable.

Current Behavior

After the events described in the context pod is not able to establish connect to anything anymore.

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Your Environment

tomastigera commented 3 months ago
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:24, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="cali58c64a0c8be" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-1" endpoint_id:"eth0" 
2024-06-01T07:40:08+03:00 libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-01T07:40:08+03:00 2024-06-01 04:40:08.995 [WARNING][69] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:25, Iface:"cali58c64a0c8be", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0x3}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}

Never seen that libbpf error. Might be that there is no qdisc set up at the device? You could you tc to show what qdiscs are attached.

root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep calid2e5d
7766942: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
7766943: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131461: sched_cls  name p4I_calid2e5dc9  tag 456628228769520b
8131483: sched_cls  name p4E_calid2e5dc9  tag 5126a1005cd55b97
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58c64a0
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8

It is weird to see policy programs attached 2x. Also the [ie] vs [EI] is strange, might be some upgrade issue :thinking:

What version did you upgrade from?

tomastigera commented 3 months ago

You can use bpftool net to see which programs are directly connected to the device.Those :arrow_up: policy programs are called eventually in a chain of programs.

tomastigera commented 3 months ago

given the lowercase e in p4e_cali58c64a0 is seems like you went from 3.25ish? Seems like some cleanup did not happen.

The error message comes from https://github.com/torvalds/linux/blob/92e5605a199efbaee59fb19e15d6cc2103a04ec2/net/sched/cls_api.c#L2316

Are the failing pods those that eisted before upgrade or does it happen to new pods as well?

xander-sh commented 3 months ago

We upgrade from calico 3.25.0 (apply new manifest without delete old version)

Are the failing pods those that eisted before upgrade or does it happen to new pods as well?

It's happen also with new pods (statefulset recreate with the same pod name)

You can use bpftool net to see which programs are directly connected to the device

I restart statefulset pod (network connection in pod is normal)

NAMESPACE   NAME                READY    STATUS      RESTARTS     AGE     IP              NODE
airflow     newfmm-db-airflow-0  2/2     Running     0            6m10s   10.222.47.32    stage1-6495d69d65xsx98l-7s8td

❯ k exec -it newfmm-db-airflow-0 sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "postgres" out of: postgres, prometheus-postgres-exporter
# curl -k https://kubernetes.default
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {
  },
  "code": 403
}

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
    cache
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[195661] id 195661
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[195660] id 195660

After restarting postgresql, we will watch when it crashes again and collect information from bpftool, tc

xander-sh commented 2 months ago

After 4 days container in pod transitioning to unhealthy state.

❯ k -n airflow get pods -o wide newfmm-db-airflow-0
NAME                  READY   STATUS    RESTARTS   AGE    IP             NODE                            NOMINATED NODE   READINESS GATES
newfmm-db-airflow-0   1/2     Running   0          4d5h   10.222.47.32   stage1-6495d69d65xsx98l-7s8td   <none>           <none>

Some diagnostic

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32
10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0
    cache

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool prog | grep calid2b
# empty :( 
root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2
calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[400338] id 400338
calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[400339] id 400339

[root@stage1-6495d69d65xsx98l-7s8td /]# tc -s qdisc show dev calieca2ba1ed43
qdisc noqueue 0: root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc clsact ffff: parent ffff:fff1
 Sent 191201052976 bytes 376857785 pkt (dropped 53362, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
#count of dropped packets increase every second

Logs from calico-node in debug and dump more logs in attach 0.log


2024-06-23T18:22:34.654135139+03:00 stdout F 2024-06-23 15:22:34.654 [DEBUG][58967] felix/bpf_ep_mgr.go 1430: Ignoring interface that doesn't match the host data/l3 interface regex iface="calieca2ba1ed43"
2024-06-23T18:22:34.658959833+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/attach.go 353: Already have a clsact qdisc on this interface iface="calieca2ba1ed43"
2024-06-23T18:22:34.65897197+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659075818+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=ingress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659079655+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/map.go 130: needs generic object file from_wep_fib_no_log_co-re.o AttachType=hook.AttachType{Hook:0, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.659081934+03:00 stdout F 2024-06-23 15:22:34.658 [DEBUG][58967] felix/bpf_ep_mgr.go 1896: readiness: 0 iface="calieca2ba1ed43"
2024-06-23T18:22:34.659166532+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded from_wep_fib_no_log_co-re.o layout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.659169858+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/bpf_ep_mgr.go 2742: ensureProgramAttached 4 hook=egress iface="calieca2ba1ed43" type="workload"
2024-06-23T18:22:34.659171724+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [62 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.65917393+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 130: needs generic object file to_wep_no_log_co-re.o AttachType=hook.AttachType{Hook:1, Family:4, Type:"workload", LogLevel:"off", FIB:true, ToHostDrop:false, DSR:false}
2024-06-23T18:22:34.65921784+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.65933278+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/map.go 133: generic object file already loaded to_wep_no_log_co-re.o layout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.659336321+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 66: UpdateMapEntry(28, [63 0 0 0], [99 0 0 0])
2024-06-23T18:22:34.659417834+03:00 stdout F 2024-06-23 15:22:34.659 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0x1c mapInfo=&maps.MapInfo{Type:3, KeySize:4, ValueSize:4, MaxEntries:10000}
2024-06-23T18:22:34.664465985+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664703578+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 291: Found old calico program prog=tc.attachedProg{pref:"32768", handle:"0x1"}
2024-06-23T18:22:34.664710538+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.66471307+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664794065+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.664797995+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.664800341+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.66492409+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.664934394+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.664936359+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_jump2 k 4 v 4
2024-06-23T18:22:34.664938181+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.664979423+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_progs2 k 4 v 4
2024-06-23T18:22:34.664989684+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}
2024-06-23T18:22:34.665120085+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_arp2 k 8 v 12
2024-06-23T18:22:34.665124979+03:00 stdout F 2024-06-23 15:22:34.664 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_ct3 k 16 v 88
2024-06-23T18:22:34.665140656+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_fe3 k 16 v 20
2024-06-23T18:22:34.665142986+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_be k 8 v 8
2024-06-23T18:22:34.665278386+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_v4_nat_aff k 16 v 16
2024-06-23T18:22:34.665282781+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 104: Pinning map cali_state4 k 4 v 464
2024-06-23T18:22:34.665401168+03:00 stdout F 2024-06-23 15:22:34.665 [DEBUG][58967] felix/attach.go 452: ConfigureProgram HookLayout=hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}
2024-06-23T18:22:34.666112544+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.666293187+03:00 stdout F 2024-06-23 15:22:34.666 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:0, PolicyIdx:62, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilte
r:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"from", HookLayout:hook.Layout{0:39, 2:40, 3:41, 4:42, 6:43, 7:44}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667112806+03:00 stdout F libbpf: Kernel error message: Filter with specified priority/protocol not found
2024-06-23T18:22:34.66730246+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/attach.go 171: Failed to attach to TC section cali_tc_preamble attachPoint=&tc.AttachPoint{AttachPoint:bpf.AttachPoint{Hook:1, PolicyIdx:63, Iface:"calieca2ba1ed43", LogLevel:"off"}, LogFilter:"", LogFilterIdx:-1, Type:"workload", ToOrFrom:"to", HookLayout:hook.Layout{0:45, 2:46, 3:47, 4:48, 6:49, 7:50}, HostIP:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xdc, 0x1, 0xb}, HostTunnelIP:net.IP(nil), IntfIP:net.IP{0xa9, 0xfe, 0x1, 0x1}, FIB:true, ToHostDrop:false, DSR:false, DSROptoutCIDRs:false, TunnelMTU:0x5aa, VXLANPort:0x12bf, WgPort:0x0, ExtToServiceConnmark:0x0, PSNATStart:0x4e20, PSNATEnd:0x752f, IPv6Enabled:false, RPFEnforceOption:0x2, NATin:0x0, NATout:0x0, UDPOnly:false}
2024-06-23T18:22:34.667438952+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 125: Set bpfMap="cali_iface" k={ifIndex: 3984} v={flags: workload,ready, XDPPolicy: -1, IngressPolicy: 63, EgressPolicy: 62, IngressFilter: -1, EgressFilter: -1, name: calieca2ba1ed43}
2024-06-23T18:22:34.667445162+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 142: Set: Key in dataplane already, ignoring. bpfMap="cali_iface"
2024-06-23T18:22:34.667515079+03:00 stdout F 2024-06-23 15:22:34.667 [WARNING][58967] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="calieca2ba1ed43" wepID=orchestrator_id:"k8s" workload_id:"airflow/newfmm-db-airflow-0" endpoint_id:"eth0"
2024-06-23T18:22:34.667528735+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 153: Applying deletions to DP map. name="cali_iface"
2024-06-23T18:22:34.667533857+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/caching_map.go 129: Applying updates to DP map. name="cali_iface"
2024-06-23T18:22:34.667626354+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/bpf_ep_mgr.go 1327: CompleteDeferredWork done.
2024-06-23T18:22:34.667631888+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/endpoint_mgr.go 445: Reporting endpoint status. dirtyEndpoints=set.Set{}
2024-06-23T18:22:34.667792509+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 276: Asked to resync with the dataplane on next update. family="inet"
2024-06-23T18:22:34.667796401+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 215: Asked to resync with the dataplane on next update.
2024-06-23T18:22:34.667798229+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/wireguard.go 656: Wireguard is not enabled, skipping sync ipVersion=0x4
2024-06-23T18:22:34.668229581+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 342: Resyncing ipsets with dataplane. family="inet"
2024-06-23T18:22:34.668234483+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state. ipsetFamily="inet"
2024-06-23T18:22:34.668237291+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=0 inDataplaneNotDesired=0 ipsetFamily="inet" totalNumInDP=0
2024-06-23T18:22:34.668239276+03:00 stdout F 2024-06-23 15:22:34.667 [DEBUG][58967] felix/ipsets.go 243: Doing full resync of BPF IP sets map bpf family="inet"
2024-06-23T18:22:34.668246501+03:00 stdout F 2024-06-23 15:22:34.668 [DEBUG][58967] felix/syscall.go 119: Map metadata fd=0xe mapInfo=&maps.MapInfo{Type:11, KeySize:20, ValueSize:4, MaxEntries:1048576}
2024-06-23T18:22:34.669262736+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/ipsets.go 406: List of ipsets: [] family="inet"
2024-06-23T18:22:34.669269369+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669271576+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=1 inDataplaneNotDesired=0 totalNumInDP=0
2024-06-23T18:22:34.669343478+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 227: Loading cache of dataplane state.
2024-06-23T18:22:34.669347348+03:00 stdout F 2024-06-23 15:22:34.669 [DEBUG][58967] felix/delta_tracker.go 291: Updated dataplane state. desiredUpdates=12 inDataplaneNotDesired=0 totalNumInDP=0
tomastigera commented 2 months ago

root@stage1-6495d69d65xsx98l-7s8td:/# ip ro get 10.222.47.32 10.222.47.32 dev calieca2ba1ed43 src 10.220.1.11 uid 0 cache

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool prog | grep calid2b

empty :(

calid2b why grepping for this unrelated device?

root@stage1-6495d69d65xsx98l-7s8td:/# bpftool net| grep calieca2 calieca2ba1ed43(3984) clsact/ingress cali_tc_preambl:[400338] id 400338 calieca2ba1ed43(3984) clsact/egress cali_tc_preambl:[400339] id 400339

This looks like programs are attached. I will have a look at the logs.

xander-sh commented 2 months ago

why grepping for this unrelated device?

Sorry, this is my mistake. Diagnostic info from other statefulset pod newfmm-db-airflow-1 with ip 10.222.59.94 and calico interface name cali58c64a0c8be

❯ k -n airflow get pods -o wide newfmm-db-airflow-1
NAME                  READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
newfmm-db-airflow-1   1/2     Running   0          5d22h   10.222.59.94   stage1-6495d69d65xsx98l-tkf2b   <none>           <none>
root@stage1-6495d69d65xsx98l-tkf2b:/# ip ro get 10.222.59.94
10.222.59.94 dev cali58c64a0c8be src 10.220.1.3 uid 0
    cache
d65xsx98l-tkf2b:/# bpftool net| grep cali58c64a0c8be
cali58c64a0c8be(2817) clsact/ingress cali_tc_preambl:[9162962] id 9162962
cali58c64a0c8be(2817) clsact/egress cali_tc_preambl:[9162963] id 9162963
root@stage1-6495d69d65xsx98l-tkf2b:/# bpftool prog | grep cali58
6479986: sched_cls  name p4i_cali58c64a0  tag 0dbf0d10b02282f0
6479987: sched_cls  name p4e_cali58c64a0  tag b1d839878b133ee8
legolego621 commented 2 months ago

The problem also exists on the controller and node version v3.27.3. Our problem appeared on clusters version 1.22 and 1.27

ChandonPierre commented 2 months ago

We're seeing the same problem on k8s 1.29 + kernel 6.5 + calico 3.72.2

If this is at all helpful:

 bpftool net list | grep cali1c22a1cbbe2
cali1c22a1cbbe2(26) clsact/ingress cali_tc_preambl:[132768] id 132768
cali1c22a1cbbe2(26) clsact/egress cali_tc_preambl:[132769] id 132769

tc filter show dev cali1c22a1cbbe2 >> nothing

 bpftool prog show | grep -E "132768|132769" -A 4
132768: sched_cls  name cali_tc_preambl  tag 4e084811a5e4fbbf  gpl
        loaded_at 2024-07-12T11:40:22+0000  uid 0
        xlated 2144B  jited 1410B  memlock 4096B  map_ids 247,66441,739,738,66442
        btf_id 33214
132769: sched_cls  name cali_tc_preambl  tag 4e084811a5e4fbbf  gpl
        loaded_at 2024-07-12T11:40:22+0000  uid 0
        xlated 2144B  jited 1410B  memlock 4096B  map_ids 247,66443,739,738,66444
        btf_id 33215

bpftool map show | grep -E "33214|33215" -B 2
66441: array  name tc_pream.rodata  flags 0x480
        key 4B  value 220B  max_entries 1  memlock 4096B
        btf_id 33214
--
66443: array  name tc_pream.rodata  flags 0x480
        key 4B  value 220B  max_entries 1  memlock 4096B
        btf_id 33215

bpftool map dump id 66441
[{
        "value": {
            ".rodata": [{
                    "__globals": {
                        "host_ip": 2311630346,
                        "tunnel_mtu": 1500,
                        "vxlan_port": 4789,
                        "intf_ip": 16907945,
                        "ext_to_svc_mark": 0,
                        "psnat_start": 20000,
                        "psnat_len": 29999,
                        "host_tunnel_ip": 2311630346,
                        "flags": 16,
                        "wg_port": 0,
                        "__pad": 0,
                        "natin_idx": 0,
                        "natout_idx": 0,
                        "iface_name": [99,97,108,105,49,99,50,50,97,49,99,98,98,101,50,0
                        ],
                        "log_filter_jmp": 4294967295,
                        "jumps": [39,37,40,41,42,4294967295,43,44,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295
                        ],
                        "__scratch": []
                    }
                }
            ]
        }
    }
]

bpftool map dump id 66443
[{
        "value": {
            ".rodata": [{
                    "__globals": {
                        "host_ip": 2311630346,
                        "tunnel_mtu": 1500,
                        "vxlan_port": 4789,
                        "intf_ip": 16907945,
                        "ext_to_svc_mark": 0,
                        "psnat_start": 20000,
                        "psnat_len": 29999,
                        "host_tunnel_ip": 2311630346,
                        "flags": 16,
                        "wg_port": 0,
                        "__pad": 0,
                        "natin_idx": 0,
                        "natout_idx": 0,
                        "iface_name": [99,97,108,105,49,99,50,50,97,49,99,98,98,101,50,0
                        ],
                        "log_filter_jmp": 4294967295,
                        "jumps": [45,36,46,47,48,4294967295,49,50,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295,4294967295
                        ],
                        "__scratch": []
                    }
                }
            ]
        }
    }
]
legolego621 commented 1 month ago

We were able to achieve reproducibility of the problem on v3.27.4 and kubernetes version 1.29.

  1. Any pod in the cluster will be required
  2. The pod will need to be patched (we patched the annotation in the pod), for example, 120 times per minute (to speed up the manifestation of the problem).

The revision of the pod will be updated.

$ kubectl get pod debug-pod-0 -o yaml | grep resourceVersion
  resourceVersion: "609121899" 

And the revision of the endpoint

$ kubectl get endpoints debug-pod-0 -o yaml | grep renewTime    
  renewTime: "2024-07-25T09:05:50.365691+00:00"
$ kubectl get endpoints debug-pod-0 -o yaml | grep resource
  resourceVersion: "610355031"
  1. after about a day or two day, we can see a broken network in the pod. And also similar logs.
    $ kubectl logs -n kube-system -l k8s-app=calico-node --tail 10000 | grep "Failed to apply policy to endpoin"
    2024-07-26 06:23:18.496 [WARNING][73] felix/bpf_ep_mgr.go 1675: Failed to apply policy to endpoint, leaving it dirty error=attaching program to wep: error attaching tc program invalid argument name="calie87075f38f0" wepID=orchestrator_id:"k8s" workload_id:"test-namespace/debug-pod-0" endpoint_id:"eth0"
sridhartigera commented 1 month ago

@legolego621 what was the annotation that you patched?

legolego621 commented 1 month ago

@legolego621 what was the annotation that you patched?

Any annotations. For example I added test=value1....n. I think, for labels will be same situations ( the main thing - update revision of pod and update endpoint)

sridhartigera commented 1 month ago

Editing the pod labels works.

legolego621 commented 1 month ago

Editing the pod labels works.

the problem manifests itself after about a day, when the pod is patched 150 times per minute

legolego621 commented 1 month ago

We did this artificially to reproduce the problem.29.07.2024, 13:53, "Shaun Crampton" @.***>:

when the pod is patched 150 times per minute

What causes the pod to be patched 150 times per minute (or is that just your test to repro the issue)? Calico shouldn't do that, just want to check it's something that you're doing outside Calico.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

elijahnyp commented 1 month ago

There are a few things that use annotations to track internal state - one of them is patroni for HA postgres.

fasaxc commented 1 month ago

Yep, missed your earlier comment, thanks. I think @sridhartigera has found the root cause now and he's working on a fix. Each time we attach a program, the kernel uses a new priority number and eventually it runs out of priorities.