Open garloff opened 5 months ago
Looking inside the octavia_api container, I can see the octavia-api apache2 process leaking file descriptors, at a rate of ~140 per hour (when confronted with the openstack-health-monitor workload). So if my theory is correct that some process is limited to 1024 of them, we have roughly 7 hours before we crash, which is consistent with my observation.
The limit is likely hard-coded (in Python) and cannot be changed (I checked ulimits -n, it's > 1024 inside and outside the containers). This means we have to find the leak in the API process. Or in python3-ovs in the poller.py. select() is used only there.
--> https://github.com/openvswitch/ovs/blob/main/python/ovs/poller.py#L64
# eventlet/gevent doesn't support select.poll. If select.poll is used,
# python interpreter is blocked as a whole instead of switching from the
# current thread that is about to block to other runnable thread.
# So emulate select.poll by select.select because using python means that
# performance isn't so important.
---> https://github.com/openvswitch/ovs/blob/main/python/ovs/poller.py#L270-L290
It's an epoll_create1()
system call that is being invoked by python3 code without ever being closed.
It looks to me like whenever the set of resources that belong to a loadbalancer changes, the set of pollable events changes and a new epoll_create1()
system call is made, allowing the code to receive events. This creates a new [eventpoll]
file descriptor. However, the old one is not closed, so we have our leak.
Probably some wrong usage of ovs/poller.py where we forget to close() some fd. Maybe the one passed to fd_wait()
Upstream bug report at https://bugs.launchpad.net/octavia/+bug/2063043
I haven't studied the logic of eventlets ... very deeply. Are we just suffering that we create threads that should exit (and this way clean up the file descriptors)? Looking at the list of file descriptors
[...]
apache2 2625091 octavia 21u IPv4 880892680 0t0 TCP api.in-a-box.cloud:37214->api.in-a-box.cloud:5000 (CLOSE_WAIT)
apache2 2625091 octavia 22u IPv4 880928052 0t0 TCP api.in-a-box.cloud:53454->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 2625091 octavia 23r FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 24w FIFO 0,14 0t0 880607178 pipe
apache2 2625091 octavia 25u sock 0,8 0t0 880607182 protocol: TCP
apache2 2625091 octavia 26u IPv4 880893583 0t0 TCP api.in-a-box.cloud:39068->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 27u sock 0,8 0t0 880607189 protocol: TCP
apache2 2625091 octavia 28u sock 0,8 0t0 880732423 protocol: TCP
apache2 2625091 octavia 29u sock 0,8 0t0 880724669 protocol: TCP
apache2 2625091 octavia 30u sock 0,8 0t0 880716764 protocol: TCP
apache2 2625091 octavia 31u sock 0,8 0t0 880644259 protocol: TCP
apache2 2625091 octavia 32u IPv4 880939067 0t0 TCP api.in-a-box.cloud:50304->api.in-a-box.cloud:9696 (ESTABLISHED)
apache2 2625091 octavia 33u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 34u sock 0,8 0t0 880823397 protocol: TCP
apache2 2625091 octavia 35u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 36u IPv4 880893589 0t0 TCP api.in-a-box.cloud:39094->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 37u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 38u IPv4 880891827 0t0 TCP api.in-a-box.cloud:39140->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 39u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 40u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 41u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 42u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 43u IPv4 880886692 0t0 TCP api.in-a-box.cloud:54838->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 44u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 45u IPv4 880895843 0t0 TCP api.in-a-box.cloud:54860->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia 46u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 48u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 50u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 52u a_inode 0,15 0 68 [eventpoll]
apache2 2625091 octavia 53u a_inode 0,15 0 68 [eventpoll]
I see not only the leaked [eventpoll] but also an occasional TCP connection in there.
Testcase:
#!/bin/bash
#
# Test case to reproduce bobcat ocatvia-api ovn fd leak
# OSISM #959: https://github.com/osism/issues/issues/959
# (c) Kurt Garloff <kurt@garloff.de>, 5/2024
# SPDX-License-Identifier: Apache-2.0
#
# This runs an endless loop that creates and delete LB members;
# triggering a file descrptor leak in octavia-api.
# Run with option -d for debugging
# Run with -u to clean up leftovers from previous and exit
# Hit ^C (ideally during the sleep after the Member2 has been created)
# to exit with cleanup.
CAT=$(openstack catalog list)
if test $? != 0; then echo "Configure OS_CLOUD and openstack client properly"; exit 1; fi
set -e
if test "$1" = "-d"; then shift; set -x; fi
STOP=0
interrupt()
{
echo "Interrupt (^C), cleaning up"
STOP=1
set +e
}
trap interrupt SIGINT
cleanup()
{
set +e
RC=0
echo -en "Clean LB TESTLB959 "
openstack loadbalancer delete --cascade TESTLB959 --wait
#let RC+=$?
echo -en "\nClean Subnet TESTSUBNET959 "
openstack subnet delete TESTSUBNET959
let RC+=$?
echo -en "\nClean Network TESTNET959 "
openstack network delete TESTNET959
let RC+=$?
echo
return $RC
}
if test "$1" = "-u"; then cleanup; exit; fi
NET=$(openstack network create TESTNET959 -f value -c id)
echo "Network $NET"
SUBNET=$(openstack subnet create --network TESTNET959 --subnet-range 192.168.59.0/24 TESTSUBNET959 -f value -c id)
echo "Subnet $SUBNET"
LB=$(openstack loadbalancer create --provider ovn --vip-subnet-id=$SUBNET --name TESTLB959 --wait -f value -c id)
echo "LB $LB"
POOL=$(openstack loadbalancer pool create --loadbalancer TESTLB959 --name TESTPOOL959 --protocol TCP --lb-algorithm SOURCE_IP_PORT --wait -f value -c id)
echo "Pool $POOL"
LISTENER=$(openstack loadbalancer listener create --name TESTLISTENER959 --default-pool TESTPOOL959 --protocol TCP --protocol-port 80 TESTLB959 --wait -f value -c id)
echo "Listener $LISTENER"
#HM=$(openstack loadbalancer healthmonitor create --name TESTHM959 --delay 3 --timeout 2 --max-retries 1 --max-retries-down 1 --type TCP TESTPOOL959 --wait -f value -c id)
#echo "HM $HM"
while test "$STOP" = "0"; do
echo -en "Member 1: "
MEM1=$(openstack loadbalancer member create --name TESTMEMBER959.1 --address 192.168.59.17 --subnet-id $SUBNET --protocol-port 80 TESTPOOL959 --wait -f value -c id)
echo -en "$MEM1\nMember 2: "
MEM2=$(openstack loadbalancer member create --name TESTMEMBER959.2 --address 192.168.59.18 --subnet-id $SUBNET --protocol-port 80 TESTPOOL959 --wait -f value -c id)
echo "$MEM2"
sleep 1
echo -en "Del Member1 $MEM1 "
openstack loadbalancer member delete TESTPOOL959 $MEM1 --wait
echo -en "\nDel Member2 $MEM2 "
openstack loadbalancer member delete TESTPOOL959 $MEM2 --wait
echo
sleep 1
done
cleanup
This test script causes my octavia-api service (Bobcat via kolla-ansible) to leak file descriptors at a rate of up to ~30/min. Interestingly the behavior is bursty. I see lots of pipe file descriptors in the lsof listings. Many (all?) of the pipes go away after a while, so cleanup happens for them after a few minutes. Once we started creating many pipes again, reaching near the bottom of the fd list, the increase of highest FD starts happening again.
The script exhausts the two octavia-api processes FDs within less than an hour. One can see that the pipe FDs are regularly cleaned, which the FDs for long gone connections (socket) and the FDs for eventpoll keep increasing linearly. (We see the sum from the two octavia-api processes.) After a pipe cleanup, the kernel first reuses the freed FDs before the highest FD needs to grow further. Misc are unused FDs and the handful of FDs (e.g. regular files) that don't fit into the classification we have taken here:
The startup phase zoomed in. (eventpoll seems to be roughly 2x the sum of conn+sock)
I'm poking upstream to maybe look at this issue (last reply from May 2024).
After a few hours of running openstack-health-monitor against a CiaB-7.0.1 system with octavia OVN provider (options
-LO
and-LR
, the latter to work around osism issue #921):octavia_api
container gets unhealthy/var/log/kolla/octavia/octavia-api.log
says:ValueError: filedescriptor out of range in select()
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/drivers/driver_factory.py", line 41, in get_driver driver = stevedore_driver.DriverManager(
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/driver.py", line 54, in init super(DriverManager, self).init(
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/named.py", line 78, in init extensions = self._load_plugins(invoke_on_load,
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 218, in _load_plugins self._on_load_failure_callback(self, ep, err)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 206, in _load_plugins ext = self._load_one_plugin(ep,
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/named.py", line 156, in _load_one_plugin return super(NamedExtensionManager, self)._load_one_plugin(
File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 242, in _load_one_plugin obj = plugin(*invoke_args, **invoke_kwds)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/driver.py", line 48, in init self._ovn_helper = ovn_helper.OvnProviderHelper()
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/helper.py", line 62, in init self.ovn_nbdb_api = self.ovn_nbdb.start()
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 209, in start return impl_idl_ovn.OvsdbNbOvnIdl(self.conn)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 142, in init super().init(connection)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 61, in init super().init(connection)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/init.py", line 40, in init self.start_connection(connection)
ovn_octavia_provider.ovsdb.impl_idl_ovn.OvsdbConnectionUnavailable: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/kolla/venv/lib/python3.10/site-packages/wsmeext/pecan.py", line 82, in callfunction result = f(self, *args, **kwargs)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/v2/controllers/member.py", line 163, in post driver = driver_factory.get_driver(provider)
File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/drivers/driver_factory.py", line 49, in get_driver raise exceptions.ProviderNotFound(prov=provider)
octavia.common.exceptions.ProviderNotFound: Provider 'ovn' was not found. : octavia.common.exceptions.ProviderNotFound: Provider 'ovn' was not found. 2024-04-09 03:01:27.360 733 INFO octavia.api.v2.controllers.listener [None req-f2e033ec-1f78-4862-87b1-75216a83d01e - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Listener 9ba08cff-35f5-47b2-a36b-67af56b9f3fc to provider ovn 2024-04-09 03:01:29.143 733 INFO octavia.api.v2.controllers.pool [None req-cce27229-f111-4c2f-a8d2-d567e149baf0 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Pool c517a428-394d-447e-a230-d2c2df965685 to provider ovn 2024-04-09 03:01:39.790 733 INFO octavia.api.v2.controllers.load_balancer [None req-7502dfd2-75b3-41ea-8958-a6e78a069518 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Load Balancer d5ae73d4-4aa4-4cbb-a947-e291446b2ca4 to provider ovn