Closed ayder closed 6 years ago
It seems I need to merge this checker with the checker from Percona.
anyway, if the cpu spike again, is there anything you would like me to capture?
Yes. Please refer to this comment. Thanks.
I'm getting this within a few hours of starting up, for about a month now.
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT @@version" -NB
1.3.4-0-gd158707
[root@LION:~]# ps aux | grep proxysql
root 28041 0.0 0.0 16800 13148 ? Ssl Mar16 0:26 /srv/proxysql_exporter -web.listen-address=:42004
root 28482 0.0 0.0 34452 2972 ? S Mar16 0:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root 28484 87.9 0.4 336200 71332 ? Sl Mar16 5223:20 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM scheduler" -NB
1 1 5000 /var/lib/proxysql/proxysql_galera_checker.sh 1 2 1 1 /var/lib/proxysql/proxysql_galera_checker.log
[root@LION:~]# mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM runtime_scheduler" -NB
1 1 5000 /var/lib/proxysql/proxysql_galera_checker.sh 1 2 1 1 /var/lib/proxysql/proxysql_galera_checker.log
@bweston92 : from the graphs, it I am reading them correctly, it seems you have spiky CPU usage, and not a constantly high CPU usage. This could be anything, even traffic related.
Yet, in case it is proxysql_galera_checker.sh
, can you modify it as reported in https://github.com/sysown/proxysql/issues/826#issuecomment-286611276 ?
Thanks
While I am on this issue... @doddys : any more CPU weirdness after the change?
@renecannao the spike we have it at most 10 users a minute. Not only that I'd expect some correlation with mysqld
@renecannao I have not encountered the same problem anymore but I have upgraded the to 1.3.4 due to some crash during high number of connections. I will report when the problem reappear.
@renecannao also when I look at top it is just constantly 100% however everything still works fine.
@doddys : great!
@bweston92 : ok, this needs further investigation. Can you please collect some metrics as described in https://github.com/sysown/proxysql/issues/826#issuecomment-265441199 ? Thanks
Ok, see attached for report.txt report.txt
pt-pmp --pid 28484 --iterations 30
[root@LION:~]# pt-pmp --pid 28484 --iterations 30
Mon 20 Mar 16:06:17 GMT 2017
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
240 pthread_cond_wait,wqueue::remove,ConsumerThread::run,start_thread,clone
120 epoll_wait,MySQL_Thread::run,mysql_worker_thread_func_idles,start_thread,clone
106 poll,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
30 poll,admin_main_loop,start_thread,clone
30 nanosleep,usleep,Query_Cache::purgeHash_thread,mysql_shared_query_cache_funct,start_thread,clone
30 nanosleep,usleep,MySQL_Monitor::run,libstdc++::??(libstdc++.so.6),start_thread,clone
30 nanosleep,usleep,MySQL_Monitor::monitor_replication_lag,monitor_replication_lag_pthread,start_thread,clone
30 nanosleep,usleep,MySQL_Monitor::monitor_read_only,monitor_read_only_pthread,start_thread,clone
30 nanosleep,usleep,MySQL_Monitor::monitor_ping,monitor_ping_pthread,start_thread,clone
30 nanosleep,usleep,MySQL_Monitor::monitor_connect,monitor_connect_pthread,start_thread,clone
30 nanosleep,usleep,main
4 __waitpid,waitpid_thread,start_thread,clone
4 nanosleep,usleep,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
4 MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
2 __random,rand,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
2 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),mysql_real_query(libmariadb.c:2538),mysql_real_query_start_internal(mysql_async.c:490),my_context_spawn(my_context.c:205)
1 poll,child_mysql,start_thread,clone
1 MySQL_Data_Stream::set_pollout,MySQL_Thread::run,mysql_worker_thread_func,start_thread,clone
1 __libc_send,my_send_async(mysql_async.c:174),vio_write(violite.c:441),net_real_write(net.c:438),net_flush(net.c:255),net_write_command(net.c:333),mthd_my_send_cmd(libmariadb.c:612),send_change_user_packet(my_auth.c:239),client_mpvio_write_packet(my_auth.c:239),native_password_auth_client(my_auth.c:132),run_plugin_auth(my_auth.c:646),mysql_change_user(libmariadb.c:2222),mysql_change_user_start_internal(mysql_async.c:846),my_context_spawn(my_context.c:205)
@bweston92 :
According to pt-pmp
, proxysql is mainly sending traffic to backend.
But the output of perf
is really really odd.
clear_page_c_e
is a kernel call related to memory management, and SpookyHash::Final
should be extremely fast! Is this a VM? Which Linux version?
@renecannao nope bare metal.
[root@LION:~]# cat /etc/issue
Debian GNU/Linux 9 \n \l
[root@LION:~]# uname -a
Linux LION 4.9.0-2-amd64 #1 SMP Debian 4.9.13-1 (2017-02-27) x86_64 GNU/Linux
[root@LION:~]# cat /proc/cpuinfo | grep -c hyper
0
Let's start tuning ProxySQL to use the correct number of threads. As your system seems to have 4 cores, and there are a lot of services running, I would configured ProxySQL to use only one core running this on Admin:
SET mysql-threads=1;
SAVE MYSQL VARIABLES TO DISK;
PROXYSQL RESTART;
Thanks
By there is a lot you mean ProxySQL and MySQL alone right? Also the CPU has just jumped right down, will report back if anything goes back up. What may of caused this?
By "a lot" I mean that maybe there is more than ProxySQL and MySQL : I see also mongod
and corosync
using their amount of CPU.
I am not sure what causes the issue, but the output of perf report
is really odd.
Let's start configuring the right number of threads, and let's see if the issue is related or not
Ok left it over night and now it is back.
Threads is still set to one:
+----------------------------------------+----------------------------------------------+
| Variable_name | Value |
+----------------------------------------+----------------------------------------------+
| mysql-threads | 1 |
+----------------------------------------+----------------------------------------------+
[root@LION:~]# systemctl status proxysql
● proxysql.service - Cluster Controlled proxysql
Loaded: loaded (/etc/init.d/proxysql; generated; vendor preset: enabled)
Drop-In: /run/systemd/system/proxysql.service.d
└─50-pacemaker.conf
Active: active (running) since Thu 2017-03-16 11:52:19 GMT; 4 days ago
Docs: man:systemd-sysv-generator(8)
Process: 28473 ExecStart=/etc/init.d/proxysql start (code=exited, status=0/SUCCESS)
Tasks: 13 (limit: 4915)
Memory: 326.0M
CPU: 4d 2h 1min 13.877s
CGroup: /system.slice/proxysql.service
├─28482 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
└─28484 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
@bweston92 : I am really sorry to see that you still have these spikes.
Did you modify proxysql_galera_checker.sh
as suggested before?
After you modify proxysql_galera_checker.sh
, please collect metrics again with perf, but this type using this command:
perf record -a
Leave it run for some time, like 15-30 minutes. Then again:
perf report > report.txt
Thank you
A bit of bash kung fu:
for i in `head -n 43723 report-latest.txt | tail -n 43712 | grep -v '0.00%' | awk '{print $2}' | sort | uniq -c | awk '{print $2}'` ; do echo -n `cut -b 1-29 report-latest.txt | grep "$i " | tr '%' ' ' | awk '{print $1}' | grep -v '0.00' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
37.04% swapper
22.33% proxysql
15.15% mysqld
5.61% corosync
4.26% mongod
2.74% rsync
1.93% kthreadd
1.72% proxysql_galera
1.38% node_exporter
0.9% kworker/3:0
0.81% mysql
0.77% rcu_sched
0.45% kworker/2:1
0.44% systemd
0.44% modprobe
0.41% jbd2/dm-7-8
0.33% ha_logd
0.33% flannel
0.32% date
0.28% pcs
0.21% bash
0.21% IPaddr2
0.19% ps
0.19% awk
0.17% docker
0.17% dbus-daemon
0.15% proxysql_export
0.14% mongodb_exporte
0.14% kworker/2:1H
0.13% ksoftirqd/2
0.11% systemctl
0.11% lrmd
0.11% attrd
0.1% tail
0.08% pacemakerd
0.08% crmd
0.07% kworker/3:1H
0.04% timeout
0.04% mkdir
0.04% grep
0.03% wsrep_sst_rsync
0.03% systemd-journal
0.03% proxySQLWatcher
0.03% jbd2/dm-2-8
0.03% iptables
0.02% sshd
0.02% send_arp
0.02% ntpd
The sum of these values is 99.43% , therefore the rounding error is minimal! The above already show something interesting: proxysql is not the process/service using more cpu.
A bit more shell scripting, specifically to proxysql
, a lot of time is spent in kernel space:
grep "proxysql " report-latest.txt | grep -v '0.00%' > proxysql_report-latest.txt
$ for i in `cat proxysql_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' proxysql_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
13.35% kernel.kallsyms
6.9% proxysql
1.42% libpthread-2.24.so
0.31% unknown
0.29% libc-2.24.so
0.03% vdso
0.03% libstdc++.so.6.0.22
In user space, the amount of cpu time spent by proxysql is less than mysqld:
grep "mysqld " report-latest.txt | grep -v '0.00%' > mysqld_report-latest.txt
for i in `cat mysqld_report-latest.txt | awk '{print $3}' | sort | uniq -c | awk '{print $2}' | tr -d '[]'` ; do echo -n `awk '{print $1" "$3}' mysqld_report-latest.txt | grep $i | awk '{print $1}' | tr -d '%' | awk '{ SUM += $1} END { print SUM }'` ; echo "% $i" ; done | sort -n -r
8.68% mysqld
3.51% kernel.kallsyms
1.24% libc-2.19.so
0.98% unknown
0.27% libgalera_smm.so
0.18% libaio.so.1.0.1
0.15% libstdc++.so.6.0.20
0.11% libpthread-2.19.so
0.03% ld-2.19.so
But what concerns me more is the first line of perf report:
12.55% swapper [kernel.kallsyms] [k] intel_idle
I suspect that the odd CPU behaviour you are seeing is caused by power management. Is it possible to disable any power management in this box? Thanks
@renecannao ok, however:
Only when the machine is running ProxySQL does this issue occur.
@bweston92 , to be clear, I am not saying "nothing wrong with ProxySQL" . I am saying that the behavior you are seeing (and that I am not able to reproduce in hundreds of different servers) may be triggered by something specific in your system, and we need to understand what triggers it.
We also encountered this, CPU usage around 200%.
I realized the log file keep recording this every 2 seconds:
2017-03-26 21:34:04 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 1 , address: HOST1 , port: 3306 , weight: 1 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 2 , address: HOST1 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 3 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2_RO , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 4 , address: HOST2 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
HID: 100 , address: HOST3 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment:
2017-03-26 21:34:06 [INFO] New mysql_replication_hostgroups table
writer_hostgroup: 1 , reader_hostgroup: 2, TMS Base Writer and Reader Group
writer_hostgroup: 3 , reader_hostgroup: 4, TMS TM Writer and Reader Group
OFFLINE_HARD
looks weird here; the host is accessible otherwise.
This is the top
output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
997 root 20 0 276160 37768 5052 S 199.7 0.5 14575:48 proxysql
@oytuntez: the output you provided is not really helpful in debugging what is the problem.
This issue is full of examples of how to collect metric: the output of top
gives no information.
Similarly about the entries in the error log: copying the last 200-300 lines gives more information than "this is recorded every 2 seconds", because, maybe, it is also written what cause these lines to be written
I will collect more once it occurs again. This was in production, so didn't have much time.
i'm am also experiencing this issue. CPU drops to 0% for a few seconds, swap spikes to 100% usage, then CPU spikes to 100% across all cores. I have a fairly simple setup. 3 servers, singlewrite mode. Some times it goes days without an issue, Other-times a few hours. In all cases, when the CPU drops to 0%, proxysql stops handling requests and CPU spikes.
I first thought it was a configuration error so I reinstalled the entire OS and re-setup proxysql. But it started happening last night again. I'm using proxysql version 1.3.6-1 on Ubuntu 16.04.2 LTS. I am running 5.7.18-15-57-log Percona XtraDB Cluster (GPL), Release rel15, Revision 7693d6e, WSREP version 29.20, wsrep_29.20.
What information would you need from me to help you debug if this is a proxysql issue?
@NextStepGuru : try 1.3.7 , this patch may solve random hangs. But what concerns me in your comment is swap spiking to 100% : this could be a cause for unpredictable consequences. Is it proxysql swapping? How much memory is using?
I have deployed multi-cluster MySQL with Jelastic Cloud and also have peaks at 200% CPU for no apparent reason.
The servers are isolated, there is only ProxySQL on it, the only connection that uses it is mine and ProxySQL always ends up racing.
I am using version 2.0.12 with mysql 8.0.20.
Proxysql CPU usage greatly increase in 24 hours with only 2 connections active. Server is RHEL 6.8 running 2.6.32-642.4.2.el6.x86_64 kernel running proxysql-1.3.0h-1.x86_64