ovn-org / ovn

Open Virtual Network
Apache License 2.0
523 stars 254 forks source link

extend_table|ERR|table meter-table: out of table ids #259

Open frct1 opened 2 months ago

frct1 commented 2 months ago

Hello, We running kinda big hypervisors (hundreds of small short-lived VMs) based on OpenStack and started to face issues that DHCP response not being send dhcp offer to tap interface at all (but logs shows that DHCPOFFER has been sent). While starting ovn-controller we always seeing this err log line that probably related to this:

2024-09-16T20:28:49.013Z|00526|extend_table|ERR|table meter-table: out of table ids.

The real weird thing that ovn-controller version is actual and issue should be gone starting one of 2023.* fall releases mentioned here, but it is not

OpenStack deployed using kolla-ansible, master version which ovn-controller is at version 2024.3.2 (info) Versions:

ovn-controller 24.03.2
Open vSwitch Library 3.3.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.33.0
ovs-vsctl (Open vSwitch) 3.3.1
DB Schema 8.5.0

What could be a reason for this?

frct1 commented 2 months ago

CC some folks who might have related experience on table sizes. @igsilya @dceara Folk at OpenStack community sent link to this patchwork that shows about group and meter tables not limited as 16bit.

igsilya commented 2 months ago

How many meters do you have configured in OVS? You may also run ovs-ofctl -OOpenFlow15 meter-features br-int to see how many meters your datapath supports. For the kernel datapath, the value is dynamic and depends on how much RAM the system has and some other factors, IIRC, but it's capped at 200K. For userspace datapath it is limited to 256K.

frct1 commented 2 months ago

How many meters do you have configured in OVS? You may also run ovs-ofctl -OOpenFlow15 meter-features br-int to see how many meters your datapath supports. For the kernel datapath, the value is dynamic and depends on how much RAM the system has and some other factors, IIRC, but it's capped at 200K. For userspace datapath it is limited to 256K.

Old hypervisor where QoS and DHCP issue presents:

# ovs-ofctl -OOpenFlow15 meter-features br-int
OFPST_METER_FEATURES reply (OF1.5) (xid=0x2):
max_meter:0 max_bands:0 max_color:0
band_types: 0
capabilities: 

# ovs-ofctl -O OpenFlow15 dump-meters br-int | grep "meter" | wc -l
0

Fresh provisioned hypervisor with OVN (no QoS or DHCP issue observed):

# ovs-ofctl -OOpenFlow15 meter-features br-int
OFPST_METER_FEATURES reply (OF1.5) (xid=0x2):
max_meter:200000 max_bands:1 max_color:0
band_types: drop
capabilities: kbps pktps burst stats

#ovs-ofctl -O OpenFlow15 dump-meters br-int | grep "meter" | wc -l
1544

1544 is nearly close to a total port number (774) * 2 created in OpenStack because QoS is configured for ingress and egress as well.

Versions are the same.

igsilya commented 2 months ago

OK. So, your issue is max_meter:0. It means your datapath (kernel?) doesn't support meters, or for some reason ovs-vswitchd thinks that the datapath doesn't support meters. What is your kernel version? Also, what does ovs-appctl dpif/show-dp-features br-int show? Are there any errors/warnings related to meters in the ovs-vswitchd.log ?

frct1 commented 2 months ago

Kernel 5.15 is used across all hypervisors: 5.15.0-107-generic and 5.15.0-122-generic.

what does ovs-appctl dpif/show-dp-features br-int show

Fresh provisioned:

Masked set action: Yes
Tunnel push pop: No
Ufid: Yes
Truncate action: Yes
Clone action: Yes
Sample nesting: 10
Conntrack eventmask: Yes
Conntrack clear: Yes
Max dp_hash algorithm: 0
Check pkt length action: Yes
Conntrack timeout policy: Yes
Explicit Drop action: No
Optimized Balance TCP mode: No
Conntrack all-zero IP SNAT: Yes
MPLS Label add: Yes
Max VLAN headers: 2
Max MPLS depth: 3
Recirc: Yes
CT state: Yes
CT zone: Yes
CT mark: Yes
CT label: Yes
CT state NAT: Yes
CT orig tuple: Yes
CT orig tuple for IPv6: Yes
IPv6 ND Extension: No

Where issue observed:

Masked set action: Yes
Tunnel push pop: No
Ufid: Yes
Truncate action: Yes
Clone action: Yes
Sample nesting: 10
Conntrack eventmask: Yes
Conntrack clear: Yes
Max dp_hash algorithm: 0
Check pkt length action: Yes
Conntrack timeout policy: Yes
Explicit Drop action: No
Optimized Balance TCP mode: No
Conntrack all-zero IP SNAT: Yes
MPLS Label add: Yes
Max VLAN headers: 2
Max MPLS depth: 3
Recirc: Yes
CT state: Yes
CT zone: Yes
CT mark: Yes
CT label: Yes
CT state NAT: Yes
CT orig tuple: Yes
CT orig tuple for IPv6: Yes
IPv6 ND Extension: No

Are there any errors/warnings related to meters in the ovs-vswitchd.log ?

Yep, did some grep, there are.

First hypervisor with broken metering feature:

2024-09-13T14:11:21.894Z|379262|coverage|INFO|dpif_meter_set             0.0/sec     0.000/sec        0.0000/sec   total: 9658
2024-09-13T14:11:21.894Z|379263|coverage|INFO|dpif_meter_del             0.0/sec     0.000/sec        0.0000/sec   total: 8160
2024-09-13T14:19:31.684Z|00032|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-13T14:19:31.684Z|00033|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-13T14:19:31.684Z|00034|dpif_netlink|INFO|dpif_netlink_meter_transact get failed
2024-09-13T14:19:31.684Z|00035|dpif_netlink|INFO|The kernel module has a broken meter implementation.
2024-09-13T14:44:42.548Z|00032|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-13T14:44:42.548Z|00033|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-13T14:44:42.548Z|00034|dpif_netlink|INFO|dpif_netlink_meter_transact get failed
2024-09-13T14:44:42.548Z|00035|dpif_netlink|INFO|The kernel module has a broken meter implementation.

Second hypervisor with broken metering:

2024-09-16T20:28:46.386Z|00032|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-16T20:28:46.386Z|00033|dpif_netlink|INFO|dpif_netlink_meter_transact OVS_METER_CMD_SET failed
2024-09-16T20:28:46.386Z|00034|dpif_netlink|INFO|dpif_netlink_meter_transact get failed
2024-09-16T20:28:46.386Z|00035|dpif_netlink|INFO|The kernel module has a broken meter implementation.

13 of September is the first day when metering issue has started and probably become broken for some reason