osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

octavia/ovn: filedescriptor out of range in select() #959

Open garloff opened 5 months ago

garloff commented 5 months ago

After a few hours of running openstack-health-monitor against a CiaB-7.0.1 system with octavia OVN provider (options -LO and -LR, the latter to work around osism issue #921):

ValueError: filedescriptor out of range in select()

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/drivers/driver_factory.py", line 41, in get_driver driver = stevedore_driver.DriverManager(

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/driver.py", line 54, in init super(DriverManager, self).init(

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/named.py", line 78, in init extensions = self._load_plugins(invoke_on_load,

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 218, in _load_plugins self._on_load_failure_callback(self, ep, err)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 206, in _load_plugins ext = self._load_one_plugin(ep,

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/named.py", line 156, in _load_one_plugin return super(NamedExtensionManager, self)._load_one_plugin(

File "/var/lib/kolla/venv/lib/python3.10/site-packages/stevedore/extension.py", line 242, in _load_one_plugin obj = plugin(*invoke_args, **invoke_kwds)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/driver.py", line 48, in init self._ovn_helper = ovn_helper.OvnProviderHelper()

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/helper.py", line 62, in init self.ovn_nbdb_api = self.ovn_nbdb.start()

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 209, in start return impl_idl_ovn.OvsdbNbOvnIdl(self.conn)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 142, in init super().init(connection)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovn_octavia_provider/ovsdb/impl_idl_ovn.py", line 61, in init super().init(connection)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/ovsdbapp/backend/ovs_idl/init.py", line 40, in init self.start_connection(connection)

raise connection_exception from e

ovn_octavia_provider.ovsdb.impl_idl_ovn.OvsdbConnectionUnavailable: OVS database connection to OVN_Northbound failed with error: 'filedescriptor out of range in select()'. Verify that the OVS and OVN services are available and that the 'ovn_nb_connection' and 'ovn_sb_connection' configuration options are correct.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/var/lib/kolla/venv/lib/python3.10/site-packages/wsmeext/pecan.py", line 82, in callfunction result = f(self, *args, **kwargs)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/v2/controllers/member.py", line 163, in post driver = driver_factory.get_driver(provider)

File "/var/lib/kolla/venv/lib/python3.10/site-packages/octavia/api/drivers/driver_factory.py", line 49, in get_driver raise exceptions.ProviderNotFound(prov=provider)

octavia.common.exceptions.ProviderNotFound: Provider 'ovn' was not found. : octavia.common.exceptions.ProviderNotFound: Provider 'ovn' was not found. 2024-04-09 03:01:27.360 733 INFO octavia.api.v2.controllers.listener [None req-f2e033ec-1f78-4862-87b1-75216a83d01e - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Listener 9ba08cff-35f5-47b2-a36b-67af56b9f3fc to provider ovn 2024-04-09 03:01:29.143 733 INFO octavia.api.v2.controllers.pool [None req-cce27229-f111-4c2f-a8d2-d567e149baf0 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Pool c517a428-394d-447e-a230-d2c2df965685 to provider ovn 2024-04-09 03:01:39.790 733 INFO octavia.api.v2.controllers.load_balancer [None req-7502dfd2-75b3-41ea-8958-a6e78a069518 - 2767ef0256804b92ae0e51e3a99f809a - - 93db18cb391748009639058c52577527 93db18cb391748009639058c52577527] Sending delete Load Balancer d5ae73d4-4aa4-4cbb-a947-e291446b2ca4 to provider ovn


* octavia does no longer work, a restart of the container however helps.

Maybe we are indeed running out of file descriptors that can be selected on? Two decades ago, the `FD_SETSIZE` usable for `select()` in glibc was only 1024, and the error message reminds me of that situation.
garloff commented 5 months ago

Looking inside the octavia_api container, I can see the octavia-api apache2 process leaking file descriptors, at a rate of ~140 per hour (when confronted with the openstack-health-monitor workload). So if my theory is correct that some process is limited to 1024 of them, we have roughly 7 hours before we crash, which is consistent with my observation.

berendt commented 5 months ago

The limit is likely hard-coded (in Python) and cannot be changed (I checked ulimits -n, it's > 1024 inside and outside the containers). This means we have to find the leak in the API process. Or in python3-ovs in the poller.py. select() is used only there.

berendt commented 5 months ago

--> https://github.com/openvswitch/ovs/blob/main/python/ovs/poller.py#L64

# eventlet/gevent doesn't support select.poll. If select.poll is used,
# python interpreter is blocked as a whole instead of switching from the
# current thread that is about to block to other runnable thread.
# So emulate select.poll by select.select because using python means that
# performance isn't so important.

---> https://github.com/openvswitch/ovs/blob/main/python/ovs/poller.py#L270-L290

garloff commented 5 months ago

It's an epoll_create1() system call that is being invoked by python3 code without ever being closed. It looks to me like whenever the set of resources that belong to a loadbalancer changes, the set of pollable events changes and a new epoll_create1() system call is made, allowing the code to receive events. This creates a new [eventpoll] file descriptor. However, the old one is not closed, so we have our leak.

garloff commented 5 months ago

Probably some wrong usage of ovs/poller.py where we forget to close() some fd. Maybe the one passed to fd_wait()

garloff commented 5 months ago

Upstream bug report at https://bugs.launchpad.net/octavia/+bug/2063043

garloff commented 5 months ago

I haven't studied the logic of eventlets ... very deeply. Are we just suffering that we create threads that should exit (and this way clean up the file descriptors)? Looking at the list of file descriptors

[...]
apache2 2625091 octavia   21u     IPv4          880892680      0t0       TCP api.in-a-box.cloud:37214->api.in-a-box.cloud:5000 (CLOSE_WAIT)
apache2 2625091 octavia   22u     IPv4          880928052      0t0       TCP api.in-a-box.cloud:53454->api.in-a-box.cloud:mysql (ESTABLISHED)
apache2 2625091 octavia   23r     FIFO               0,14      0t0 880607178 pipe
apache2 2625091 octavia   24w     FIFO               0,14      0t0 880607178 pipe
apache2 2625091 octavia   25u     sock                0,8      0t0 880607182 protocol: TCP
apache2 2625091 octavia   26u     IPv4          880893583      0t0       TCP api.in-a-box.cloud:39068->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia   27u     sock                0,8      0t0 880607189 protocol: TCP
apache2 2625091 octavia   28u     sock                0,8      0t0 880732423 protocol: TCP
apache2 2625091 octavia   29u     sock                0,8      0t0 880724669 protocol: TCP
apache2 2625091 octavia   30u     sock                0,8      0t0 880716764 protocol: TCP
apache2 2625091 octavia   31u     sock                0,8      0t0 880644259 protocol: TCP
apache2 2625091 octavia   32u     IPv4          880939067      0t0       TCP api.in-a-box.cloud:50304->api.in-a-box.cloud:9696 (ESTABLISHED)
apache2 2625091 octavia   33u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   34u     sock                0,8      0t0 880823397 protocol: TCP
apache2 2625091 octavia   35u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   36u     IPv4          880893589      0t0       TCP api.in-a-box.cloud:39094->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia   37u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   38u     IPv4          880891827      0t0       TCP api.in-a-box.cloud:39140->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia   39u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   40u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   41u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   42u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   43u     IPv4          880886692      0t0       TCP api.in-a-box.cloud:54838->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia   44u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   45u     IPv4          880895843      0t0       TCP api.in-a-box.cloud:54860->api.in-a-box.cloud:9696 (CLOSE_WAIT)
apache2 2625091 octavia   46u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   48u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   50u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   52u  a_inode               0,15        0        68 [eventpoll]
apache2 2625091 octavia   53u  a_inode               0,15        0        68 [eventpoll]

I see not only the leaked [eventpoll] but also an occasional TCP connection in there.

garloff commented 4 months ago

Testcase:

#!/bin/bash
#
# Test case to reproduce bobcat ocatvia-api ovn fd leak
# OSISM #959: https://github.com/osism/issues/issues/959
# (c) Kurt Garloff <kurt@garloff.de>, 5/2024
# SPDX-License-Identifier: Apache-2.0
#
# This runs an endless loop that creates and delete LB members;
# triggering a file descrptor leak in octavia-api.
# Run with option -d for debugging
# Run with -u to clean up leftovers from previous and exit
# Hit ^C (ideally during the sleep after the Member2 has been created)
# to exit with cleanup.

CAT=$(openstack catalog list)
if test $? != 0; then echo "Configure OS_CLOUD and openstack client properly"; exit 1; fi

set -e
if test "$1" = "-d"; then shift; set -x; fi

STOP=0
interrupt()
{
        echo "Interrupt (^C), cleaning up"
        STOP=1
        set +e
}
trap interrupt SIGINT

cleanup()
{
        set +e
        RC=0
        echo -en "Clean LB TESTLB959 "
        openstack loadbalancer delete --cascade TESTLB959 --wait
        #let RC+=$?
        echo -en "\nClean Subnet TESTSUBNET959 "
        openstack subnet delete TESTSUBNET959
        let RC+=$?
        echo -en "\nClean Network TESTNET959 "
        openstack network delete TESTNET959
        let RC+=$?
        echo
        return $RC
}

if test "$1" = "-u"; then cleanup; exit; fi

NET=$(openstack network create TESTNET959 -f value -c id)
echo "Network $NET"
SUBNET=$(openstack subnet create --network TESTNET959 --subnet-range 192.168.59.0/24 TESTSUBNET959 -f value -c id)
echo "Subnet $SUBNET"
LB=$(openstack loadbalancer create --provider ovn --vip-subnet-id=$SUBNET --name TESTLB959 --wait -f value -c id)
echo "LB $LB"
POOL=$(openstack loadbalancer pool create --loadbalancer TESTLB959 --name TESTPOOL959 --protocol TCP --lb-algorithm SOURCE_IP_PORT --wait -f value -c id)
echo "Pool $POOL"
LISTENER=$(openstack loadbalancer listener create --name TESTLISTENER959 --default-pool TESTPOOL959 --protocol TCP --protocol-port 80 TESTLB959 --wait -f value -c id)
echo "Listener $LISTENER"
#HM=$(openstack loadbalancer healthmonitor create --name TESTHM959 --delay 3 --timeout 2 --max-retries 1 --max-retries-down 1 --type TCP TESTPOOL959 --wait -f value -c id)
#echo "HM $HM"
while test "$STOP" = "0"; do
        echo -en "Member 1: "
        MEM1=$(openstack loadbalancer member create --name TESTMEMBER959.1 --address 192.168.59.17 --subnet-id $SUBNET --protocol-port 80 TESTPOOL959 --wait -f value -c id)
        echo -en "$MEM1\nMember 2: "
        MEM2=$(openstack loadbalancer member create --name TESTMEMBER959.2 --address 192.168.59.18 --subnet-id $SUBNET --protocol-port 80 TESTPOOL959 --wait -f value -c id)
        echo "$MEM2"
        sleep 1
        echo -en "Del Member1 $MEM1 "
        openstack loadbalancer member delete TESTPOOL959 $MEM1 --wait
        echo -en "\nDel Member2 $MEM2 "
        openstack loadbalancer member delete TESTPOOL959 $MEM2 --wait
        echo
        sleep 1
done
cleanup
garloff commented 4 months ago

This test script causes my octavia-api service (Bobcat via kolla-ansible) to leak file descriptors at a rate of up to ~30/min. Interestingly the behavior is bursty. I see lots of pipe file descriptors in the lsof listings. Many (all?) of the pipes go away after a while, so cleanup happens for them after a few minutes. Once we started creating many pipes again, reaching near the bottom of the fd list, the increase of highest FD starts happening again.

garloff commented 4 months ago

maxfd The script exhausts the two octavia-api processes FDs within less than an hour. One can see that the pipe FDs are regularly cleaned, which the FDs for long gone connections (socket) and the FDs for eventpoll keep increasing linearly. (We see the sum from the two octavia-api processes.) After a pipe cleanup, the kernel first reuses the freed FDs before the highest FD needs to grow further. Misc are unused FDs and the handful of FDs (e.g. regular files) that don't fit into the classification we have taken here:

garloff commented 4 months ago

maxfd2 The startup phase zoomed in. (eventpoll seems to be roughly 2x the sum of conn+sock)

garloff commented 4 months ago

maxfd.csv The raw data file, generated from watching lsof.

artificial-intelligence commented 1 week ago

I'm poking upstream to maybe look at this issue (last reply from May 2024).