sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
5.98k stars 974 forks source link

proxysql high cpu usage after 24 hours #826

Closed ayder closed 6 years ago

ayder commented 7 years ago

Proxysql CPU usage greatly increase in 24 hours with only 2 connections active. Server is RHEL 6.8 running 2.6.32-642.4.2.el6.x86_64 kernel running proxysql-1.3.0h-1.x86_64

 
top - 15:42:22 up 68 days, 13 min,  1 user,  load average: 1.44, 1.38, 1.36
Tasks: 111 total,   1 running, 110 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.3%us, 75.6%sy,  0.0%ni,  0.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1914216k total,  1223260k used,   690956k free,   173528k buffers
Swap:  2093052k total,    11216k used,  2081836k free,   711984k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3554 root      20   0  222m  31m 2632 S 199.5  1.7 391:54.63 proxysql
 1520 root      20   0  930m 6424 2980 S  0.3  0.3  49:55.22 dad
root      3553     1  0 Dec06 ?        00:00:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root      3554  3553 26 Dec06 ?        06:41:30 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
renecannao commented 7 years ago

It seems I need to merge this checker with the checker from Percona.

anyway, if the cpu spike again, is there anything you would like me to capture?

Yes. Please refer to this comment. Thanks.

bweston92 commented 7 years ago

I'm getting this within a few hours of starting up, for about a month now.

[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT @@version" -NB
1.3.4-0-gd158707
[root@LION:~]# ps aux | grep proxysql
root     28041  0.0  0.0  16800 13148 ?        Ssl  Mar16   0:26 /srv/proxysql_exporter -web.listen-address=:42004
root     28482  0.0  0.0  34452  2972 ?        S    Mar16   0:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root     28484 87.9  0.4 336200 71332 ?        Sl   Mar16 5223:20 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM scheduler" -NB
1   1   5000    /var/lib/proxysql/proxysql_galera_checker.sh    1   2   1   1   /var/lib/proxysql/proxysql_galera_checker.log   
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM runtime_scheduler" -NB
1   1   5000    /var/lib/proxysql/proxysql_galera_checker.sh    1   2   1   1   /var/lib/proxysql/proxysql_galera_checker.log

screenshot from 2017-03-20 14-54-13

renecannao commented 7 years ago

@bweston92 : from the graphs, it I am reading them correctly, it seems you have spiky CPU usage, and not a constantly high CPU usage. This could be anything, even traffic related. Yet, in case it is proxysql_galera_checker.sh, can you modify it as reported in https://github.com/sysown/proxysql/issues/826#issuecomment-286611276 ? Thanks

While I am on this issue... @doddys : any more CPU weirdness after the change?

bweston92 commented 7 years ago

@renecannao the spike we have it at most 10 users a minute. Not only that I'd expect some correlation with mysqld screenshot from 2017-03-20 15-58-41

doddys commented 7 years ago

@renecannao I have not encountered the same problem anymore but I have upgraded the to 1.3.4 due to some crash during high number of connections. I will report when the problem reappear.

bweston92 commented 7 years ago

@renecannao also when I look at top it is just constantly 100% however everything still works fine.

renecannao commented 7 years ago

@doddys : great!

@bweston92 : ok, this needs further investigation. Can you please collect some metrics as described in https://github.com/sysown/proxysql/issues/826#issuecomment-265441199 ? Thanks

bweston92 commented 7 years ago

Ok, see attached for report.txt report.txt

pt-pmp --pid 28484 --iterations 30

[root@LION:~]# pt-pmp --pid 28484 --iterations 30
Mon 20 Mar 16:06:17 GMT 2017
84  ../sysdeps/unix/syscall-template.S: No such file or directory.
    240 pthread_cond_wait,wqueue::remove,ConsumerThread::run,start_thread,clone
    120 epoll_wait,MySQL_Thread::run,mysql_worker_thread_func_idles,start_thread,clone
    106 poll,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
     30 poll,admin_main_loop,start_thread,clone
     30 nanosleep,usleep,Query_Cache::purgeHash_thread,mysql_shared_query_cache_funct,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::run,libstdc++::??(libstdc++.so.6),start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_replication_lag,monitor_replication_lag_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_read_only,monitor_read_only_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_ping,monitor_ping_pthread,start_thread,clone
     30 nanosleep,usleep,MySQL_Monitor::monitor_connect,monitor_connect_pthread,start_thread,clone
     30 nanosleep,usleep,main
      4 __waitpid,waitpid_thread,start_thread,clone
      4 nanosleep,usleep,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      4 MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      2 __random,rand,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      2 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),mysql_real_query(libmariadb.c:2538),mysql_real_query_start_internal(mysql_async.c:490),my_context_spawn(my_context.c:205)
      1 poll,child_mysql,start_thread,clone
      1 MySQL_Data_Stream::set_pollout,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
      1 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),send_change_user_packet(my_auth.c:239),client_mpvio_write_packet(my_auth.c:239),native_password_auth_client(my_auth.c:132),run_plugin_auth(my_auth.c:646),mysql_change_user(libmariadb.c:2222),mysql_change_user_start_internal(mysql_async.c:846),my_context_spawn(my_context.c:205)
renecannao commented 7 years ago

@bweston92 : According to pt-pmp , proxysql is mainly sending traffic to backend.

But the output of perf is really really odd. clear_page_c_e is a kernel call related to memory management, and SpookyHash::Final should be extremely fast! Is this a VM? Which Linux version?

bweston92 commented 7 years ago

@renecannao nope bare metal.

[root@LION:~]# cat /etc/issue
Debian GNU/Linux 9 \n \l

[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux

[root@LION:~]# cat /proc/cpuinfo | grep -c hyper
0
renecannao commented 7 years ago

Let's start tuning ProxySQL to use the correct number of threads. As your system seems to have 4 cores, and there are a lot of services running, I would configured ProxySQL to use only one core running this on Admin:

SET mysql-threads=1;
SAVE MYSQL VARIABLES TO DISK;
PROXYSQL RESTART;

Thanks

bweston92 commented 7 years ago

By there is a lot you mean ProxySQL and MySQL alone right? Also the CPU has just jumped right down, will report back if anything goes back up. What may of caused this?

renecannao commented 7 years ago

By "a lot" I mean that maybe there is more than ProxySQL and MySQL : I see also mongod and corosync using their amount of CPU. I am not sure what causes the issue, but the output of perf report is really odd. Let's start configuring the right number of threads, and let's see if the issue is related or not

bweston92 commented 7 years ago

Ok left it over night and now it is back.

Threads is still set to one:

+----------------------------------------+----------------------------------------------+
| Variable_name                          | Value                                        |
+----------------------------------------+----------------------------------------------+
| mysql-threads                          | 1                                            |
+----------------------------------------+----------------------------------------------+
[root@LION:~]# systemctl status proxysql
● proxysql.service - Cluster Controlled proxysql
   Loaded: loaded (/etc/init.d/proxysql; generated; vendor preset: enabled)
  Drop-In: /run/systemd/system/proxysql.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2017-03-16 11:52:19 GMT; 4 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 28473 ExecStart=/etc/init.d/proxysql start (code=exited, status=0/SUCCESS)
    Tasks: 13 (limit: 4915)
   Memory: 326.0M
      CPU: 4d 2h 1min 13.877s
   CGroup: /system.slice/proxysql.service
           ├─28482 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
           └─28484 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

screenshot from 2017-03-21 08-22-09 screenshot from 2017-03-21 08-23-03

renecannao commented 7 years ago

@bweston92 : I am really sorry to see that you still have these spikes. Did you modify proxysql_galera_checker.sh as suggested before?

After you modify proxysql_galera_checker.sh , please collect metrics again with perf, but this type using this command:

perf record -a

Leave it run for some time, like 15-30 minutes. Then again:

perf report > report.txt

Thank you

bweston92 commented 7 years ago

report-latest.txt

renecannao commented 7 years ago

A bit of bash kung fu:

for i in `head -n 43723 report-latest.txt | tail -n 43712 | grep -v '0.00%' | awk '{print $2}' | sort | uniq -c | awk '{print $2}'` ; do echo -n `cut -b 1-29 report-latest.txt | grep "$i " | tr '%' ' ' | awk '{print $1}' | grep -v '0.00' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
37.04% swapper
22.33% proxysql
15.15% mysqld
5.61% corosync
4.26% mongod
2.74% rsync
1.93% kthreadd
1.72% proxysql_galera
1.38% node_exporter
0.9% kworker/3:0
0.81% mysql
0.77% rcu_sched
0.45% kworker/2:1
0.44% systemd
0.44% modprobe
0.41% jbd2/dm-7-8
0.33% ha_logd
0.33% flannel
0.32% date
0.28% pcs
0.21% bash
0.21% IPaddr2
0.19% ps
0.19% awk
0.17% docker
0.17% dbus-daemon
0.15% proxysql_export
0.14% mongodb_exporte
0.14% kworker/2:1H
0.13% ksoftirqd/2
0.11% systemctl
0.11% lrmd
0.11% attrd
0.1% tail
0.08% pacemakerd
0.08% crmd
0.07% kworker/3:1H
0.04% timeout
0.04% mkdir
0.04% grep
0.03% wsrep_sst_rsync
0.03% systemd-journal
0.03% proxySQLWatcher
0.03% jbd2/dm-2-8
0.03% iptables
0.02% sshd
0.02% send_arp
0.02% ntpd

The sum of these values is 99.43% , therefore the rounding error is minimal! The above already show something interesting: proxysql is not the process/service using more cpu.

A bit more shell scripting, specifically to proxysql, a lot of time is spent in kernel space:

grep "proxysql " report-latest.txt | grep -v '0.00%' > proxysql_report-latest.txt
$ for i in `cat proxysql_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' proxysql_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
13.35% kernel.kallsyms
6.9% proxysql
1.42% libpthread-2.24.so
0.31% unknown
0.29% libc-2.24.so
0.03% vdso
0.03% libstdc++.so.6.0.22

In user space, the amount of cpu time spent by proxysql is less than mysqld:

grep "mysqld " report-latest.txt | grep -v '0.00%' > mysqld_report-latest.txt
for i in `cat mysqld_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' mysqld_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
8.68% mysqld
3.51% kernel.kallsyms
1.24% libc-2.19.so
0.98% unknown
0.27% libgalera_smm.so
0.18% libaio.so.1.0.1
0.15% libstdc++.so.6.0.20
0.11% libpthread-2.19.so
0.03% ld-2.19.so

But what concerns me more is the first line of perf report:

    12.55%  swapper          [kernel.kallsyms]            [k] intel_idle

I suspect that the odd CPU behaviour you are seeing is caused by power management. Is it possible to disable any power management in this box? Thanks

bweston92 commented 7 years ago

@renecannao ok, however:

Only when the machine is running ProxySQL does this issue occur.

renecannao commented 7 years ago

@bweston92 , to be clear, I am not saying "nothing wrong with ProxySQL" . I am saying that the behavior you are seeing (and that I am not able to reproduce in hundreds of different servers) may be triggered by something specific in your system, and we need to understand what triggers it.

oytuntez commented 7 years ago

We also encountered this, CPU usage around 200%.

I realized the log file keep recording this every 2 seconds:

2017-03-26 21:34:04 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
2017-03-26 21:34:06 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group

OFFLINE_HARD looks weird here; the host is accessible otherwise.

oytuntez commented 7 years ago

This is the top output:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  997 root      20   0  276160  37768   5052 S 199.7  0.5  14575:48 proxysql
renecannao commented 7 years ago

@oytuntez: the output you provided is not really helpful in debugging what is the problem. This issue is full of examples of how to collect metric: the output of top gives no information. Similarly about the entries in the error log: copying the last 200-300 lines gives more information than "this is recorded every 2 seconds", because, maybe, it is also written what cause these lines to be written

oytuntez commented 7 years ago

I will collect more once it occurs again. This was in production, so didn't have much time.

NextStepGuru commented 7 years ago

i'm am also experiencing this issue. CPU drops to 0% for a few seconds, swap spikes to 100% usage, then CPU spikes to 100% across all cores. I have a fairly simple setup. 3 servers, singlewrite mode. Some times it goes days without an issue, Other-times a few hours. In all cases, when the CPU drops to 0%, proxysql stops handling requests and CPU spikes.

I first thought it was a configuration error so I reinstalled the entire OS and re-setup proxysql. But it started happening last night again. I'm using proxysql version 1.3.6-1 on Ubuntu 16.04.2 LTS. I am running 5.7.18-15-57-log Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20.

What information would you need from me to help you debug if this is a proxysql issue?

renecannao commented 7 years ago

@NextStepGuru : try 1.3.7 , this patch may solve random hangs. But what concerns me in your comment is swap spiking to 100% : this could be a cause for unpredictable consequences. Is it proxysql swapping? How much memory is using?

bertho-zero commented 4 years ago

I have deployed multi-cluster MySQL with Jelastic Cloud and also have peaks at 200% CPU for no apparent reason.

The servers are isolated, there is only ProxySQL on it, the only connection that uses it is mine and ProxySQL always ends up racing.

I am using version 2.0.12 with mysql 8.0.20.