lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

Controller nodes hitting limit on open connections #138

Closed astrodb closed 8 months ago

astrodb commented 10 months ago

Stelios reported an issue creating clusters and was getting the following error message:

Resource CREATE failed: RemoteError: resources.kube_cluster_deploy: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '10.19.3.200' (timed out)")

Looking at the system log on sv-ctrl-1 I see the following warnings in the system logs:

kernel: nf_conntrack: nf_conntrack: table full, dropping packet

Checking that online points to the open connection limit being hit and exceeded: https://kodeslogic.medium.com/how-to-fix-nf-conntrack-table-full-dropping-packet-a5fedc6c463d

astrodb commented 10 months ago

To alleviate the problem I've applied the fix to all three controllers. First runnning this command: echo 524288 > /proc/sys/net/netfilter/nf_conntrack_max

Then adding net.netfilter.nf_conntrack_max = 524288 to /etc/sysctl.conf

astrodb commented 10 months ago

That didn't fully fix the problem.

Continued investigation shows errors with MySQL on sv-ctrl-2. Cinder logs there report /var/log/kolla/cinder/cinder-scheduler.log:2023-11-23 13:29:11.614 7 ERROR cinder.service [req-5635adff-ce1e-4e1a-8b1f-eee0ca429cff - - - - -] model server went away: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

Other servers show this in their mariadb.log files: 2023-11-23 13:21:08 60918867 [Warning] Aborted connection 60918867 to db: 'barbican' user: 'barbican' host: 'sv-ctrl-2' (Got an error reading communication packets) 2023-11-23 13:21:34 60919501 [Warning] Aborted connection 60919501 to db: 'neutron' user: 'neutron' host: 'sv-ctrl-2' (Got an error reading communication packets) 2023-11-23 13:22:15 60920309 [Warning] Aborted connection 60920309 to db: 'neutron' user: 'neutron' host: 'sv-ctrl-2' (Got an error reading communication packets)

astrodb commented 10 months ago

@GregBlow ran kayobe overcloud database recover which has cleared the error messages in the mariadb logs. But the mariabd-cluster logs are still reporting:

2023/11/23 14:17:31 socat[1112761] W read(7, 0x563689098000, 8192): Connection reset by peer 2023/11/23 14:17:31 socat[1112761] W read(6, 0x563689098000, 8192): Connection reset by peer 2023/11/23 14:17:31 socat[1112767] W read(7, 0x563689098000, 8192): Connection reset by peer 2023/11/23 14:17:31 socat[1112767] W read(6, 0x563689098000, 8192): Connection reset by peer