proxysql high cpu usage after 24 hours

ayder commented 7 years ago

Proxysql CPU usage greatly increase in 24 hours with only 2 connections active. Server is RHEL 6.8 running 2.6.32-642.4.2.el6.x86_64 kernel running proxysql-1.3.0h-1.x86_64

 
top - 15:42:22 up 68 days, 13 min,  1 user,  load average: 1.44, 1.38, 1.36
Tasks: 111 total,   1 running, 110 sleeping,   0 stopped,   0 zombie
Cpu(s): 24.3%us, 75.6%sy,  0.0%ni,  0.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1914216k total,  1223260k used,   690956k free,   173528k buffers
Swap:  2093052k total,    11216k used,  2081836k free,   711984k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3554 root      20   0  222m  31m 2632 S 199.5  1.7 391:54.63 proxysql
 1520 root      20   0  930m 6424 2980 S  0.3  0.3  49:55.22 dad

root      3553     1  0 Dec06 ?        00:00:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
root      3554  3553 26 Dec06 ?        06:41:30 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

renecannao commented 7 years ago

Hi Sinan,

Thank you for the report, seems like a bug: it is spinning somewhere. Can you please:

check if there is anything odd reported in error log /var/lib/proxysql/proxysql.log
run the follow command with pt-pmp:
```
sudo pt-pmp --pid 3554 --iterations 30
```
run the follow command with perf, let it run for few second before killing it (ctrl+c):
```
sudo perf record -p 3554
```
Once perf record is completed, please run the follow and attach report.txt
```
sudo perf report > report.txt
```

Thanks

ayder commented 7 years ago

I haven't notice any errors in proxysql.log, only some users activity logged.

Report file is attached.

report.txt

renecannao commented 7 years ago

Thanks. Could you please attach also the output of pt-pmp ? From the output of perf it seems it is spinning on poll().

Is the two connections in use stop sending traffic, does the CPU usage go to zero?

ayder commented 7 years ago

Sorry, I have already restarted the application.

Now pt-pmp is returning

Unhandled dwarf expression opcode 0xf3

rapidly

I'll check to drop connections when this occurs agains

geotro commented 7 years ago

I'm seeing similar with the latest 1.3.1 release. Attached is the result from a run under low load. I'll rerun this later under our full query load where we see two things happening:

CPU usage seems to creep up over time, starting at around 200-300% reported by top, to 700% or more hours later under the same query load
Once CPU usage is up we also notice a large increase in memory use by the proxysql process, which stays constantly low for many hours after a fresh start, until something happens and it balloons

Some stats: in full query load we have 20-40k qps, generating ~700MBit/s+ of db traffic. We recently switched to compressing mysql traffic to keep things in check until we optimize our db usage to reduce the amount of data we push into mysql and read out of it is a good test bed for proxysql performance.

Attached is a report you can see in the mean time but we'll rerun it later when more load is introduced

rep.txt

taylor840326 commented 7 years ago

I met same problem with 1.3.1 release. The proxysql running about 2 hours. I found CPU usage very high.

ee90a5ab-a333-42ac-8bec-88ea36dd3a46

proxysql.log include some errors message:

2016-12-29 10:30:01 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN 2016-12-29 10:30:08 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT 2016-12-29 10:30:08 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN 2016-12-29 10:30:17 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT 2016-12-29 10:39:01 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot start a transaction within a transaction --- BEGIN 2016-12-29 10:39:02 sqlite3db.cpp:56:execute(): [ERROR] SQLITE error: cannot commit - no transaction is active --- COMMIT

Attached perf command report file.

report.txt

renecannao commented 7 years ago

This CPU usage seems a spike caused by some spinloop. @taylor840326 : are you graphing also ProxySQL's metrics? Would be interesting to find a correlation between that spike and other metrics.

Thanks

taylor840326 commented 7 years ago

I use telegraf+grafana monitoring CPU load

taylor840326 commented 7 years ago

@renecannao version 1.2.6 is ok.

renecannao commented 7 years ago

@taylor840326 : which version is not ok? When is not ok, do you have any profiling from perf or pt-pmp ?

taylor840326 commented 7 years ago

@renecannao I have tested v1.2.6/v1.3.1/v1.3.2 version, all the version have same problem. I'm not sure when the problem happended. Maybe 2 hours or 6 hours or 12 hours.

attached the perf report file. perf_1.2.6.txt

renecannao commented 7 years ago

@taylor840326 : this perf report doesn't seem related to what reported in #870 . Can you please run pt-pmp against proxysql? Also please share your configuration (from mysql_servers , global_variables and mysql_query_rules)?

taylor840326 commented 7 years ago

@renecannao ok, I will use pt-pmp comamnd next time.

ayder commented 7 years ago

proxysql-1.3.3 rpm running on RHEL 6.8 has the same issues

My Problem occurs mainly when there is a lag between the proxsql and mysql serves. ( Our case Proxysql is in branch office, and mysql is at production site)

# Samples: 3  of event 'cpu-clock'
# Event count (approx.): 750000
#
# Overhead  Command   Shared Object      Symbol
# ........  ........  .................  ...........
#
   100.00%  proxysql  [kernel.kallsyms]  [k] do_wait

renecannao commented 7 years ago

@ayder , can you please run pt-pmp against proxysql? When running ps you will notice two proxysql processes, one is the parent and the other is the child. Assuming 12345 is the pid of the child process, you need to run something like:

pt-pmp --iterations 30 -p 12345

I really want to close this issue, but I need to understand where it is using a lot of CPU. Thank you for your help.

ayder commented 7 years ago

The output of pt-pmp command against the child process of pt-pmp is Unhandled dwarf expression opcode 0xf3

    240 pthread_cond_wait,remove,ConsumerThread::run,start_thread(libpthread.so.0),clone(libc.so.6)
    120 epoll_wait(libc.so.6),MySQL_Thread::run,mysql_worker_thread_func_idles,start_thread(libpthread.so.0),clone(libc.so.6)
    118 poll(libc.so.6),MySQL_Thread::run,mysql_worker_thread_func,start_thread(libpthread.so.0),clone(libc.so.6)
     30 poll(libc.so.6),admin_main_loop,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),Query_Cache::purgeHash_thread,mysql_shared_query_cache_funct,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::run,libstdc++::??(libstdc++.so.6),start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_replication_lag,monitor_replication_lag_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_read_only,monitor_read_only_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_ping,monitor_ping_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),MySQL_Monitor::monitor_connect,monitor_connect_pthread,start_thread(libpthread.so.0),clone(libc.so.6)
     30 nanosleep(libc.so.6),usleep(libc.so.6),main
      2 ::??,::??,::??

renecannao commented 7 years ago

@ayder , thank you very much! From the output it seems that all the threads are waiting. The threads in MySQL_Monitor are sleeping; The Admin thread is waiting on poll(); The 4 main MySQL_Thread are waiting on poll(); The 4 auxiliary MySQL_Thread are waiting on epoll(); The 8 ConsumerThread() are waiting in the queue. I am not sure what it is causing the CPU spike: from the above everything seems quiet, very idle.

Maybe it is possible that your system isn't able to manage so many threads. Can you please reconfigure proxysql to use just one thread? Thanks

set mysql-threads=1;
save mysql variables to disk;
proxysql restart;

ayder commented 7 years ago

After setting the mysql-threads =1 Variable Cpu usage is now %100 of 1 core of 2 core vm. (That was %200 before utilizing all the cores)

renecannao commented 7 years ago

Without restarting proxysql, can you please try again to run pt-pmp , maybe with an increased number of iteration:

pt-pmp --iterations 60 -p 12345

Thank you.

ayder commented 7 years ago

Oh the output this time is:

/usr/bin/proxysql (deleted): No such file or directory. Cannot access memory at address 0x687fa290 60 ::??

I have not deleted the binary nor changed it. [root@XXDBAPPSL01 ~]# ls -l /usr/bin/proxysql -rwxr-xr-x 1 root root 13087538 Jan 21 00:10 /usr/bin/proxysql

renecannao commented 7 years ago

@ayder : this probably mean that proxysql was upgraded. You can verify this comparing the out of these two commands:

/usr/bin/proxysql --version

mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT @@version" -NB

ayder commented 7 years ago

Well versions consistent, no upgrade has done. I'll restart the aplicattion and wait again

[salyuruk@XXDBAPPSL01 ~]$ mysql --defaults-file=.proxysql.cfg -p -e "SELECT @@version" -NB Enter password: 1.3.3-0-gf2b677a [salyuruk@OSLDBAPPSL01 ~]$ proxysql --version ProxySQL version 1.3.3-0-gf2b677a, codename Truls

renecannao commented 7 years ago

This is getting really odd. Another hypothesis is that there are two proxies running. Can you please run this:

ps aux | grep proxysql

ayder commented 7 years ago

[root@XXDBAPPSL01 ~]# ps aux | grep proxysql root 19223 0.0 0.2 51260 5124 ? S Feb10 0:00 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql root 19224 16.9 1.6 172348 31804 ? Sl Feb10 1217:10 proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql root 60320 0.0 0.0 103324 848 pts/1 S+ 15:18 0:00 grep proxysql

Nope

renecannao commented 7 years ago

Honestly, I am out of ideas for now. The previous output of pt-pmp doesn't show anything odd, and now /usr/bin/proxysql (deleted): No such file or directory is very odd too. Is there anything unusual reported in dmesg or syslog or proxysql log file? What about lsof -n -p 19224 ?

ayder commented 7 years ago

[root@XXDBAPPSL01 ~]# lsof -n -p 19224 COMMAND PID USER FD TYPE proxysql 19224 root cwd DIR proxysql 19224 root rtd DIR proxysql 19224 root txt REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root mem REG proxysql 19224 root 0r CHR proxysql 19224 root 1w REG proxysql 19224 root 2w REG proxysql 19224 root 3r FIFO proxysql 19224 root 4w FIFO proxysql 19224 root 5u REG proxysql 19224 root 6u REG proxysql 19224 root 7r FIFO proxysql 19224 root 8w FIFO proxysql 19224 root 9r FIFO proxysql 19224 root 10w FIFO proxysql 19224 root 11u IPv4 proxysql 19224 root 12u REG proxysql 19224 root proxysql 19224 root 14u IPv4 proxysql 19224 root proxysql 19224 root 16u IPv4 proxysql 19224 root 17u IPv4 proxysql 19224 root 18u IPv4 proxysql 19224 root 19u sock proxysql 19224 root 20u IPv4 proxysql 19224 root 21u IPv4 DEVICE SIZE/OFF NODE NAME 253,0 4096 139114 /var/lib/proxysql 253,0 4096 2 / 253,0 12982906 390056 /usr/bin/proxysql (deleted) 253,0 1969952 395943 /usr/lib64/libcrypto.so.1.0.1e 253,0 448344 392447 /usr/lib64/libssl.so.1.0.1e 253,0 161704 396570 /lib64/ld-2.12.so 253,0 1930416 398992 /lib64/libc-2.12.so 253,0 146592 399382 /lib64/libpthread-2.12.so 253,0 23088 399347 /lib64/libdl-2.12.so 253,0 47760 399391 /lib64/librt-2.12.so 253,0 600048 399989 /lib64/libm-2.12.so 253,0 91096 392580 /lib64/libz.so.1.2.3 253,0 124640 392013 /lib64/libselinux.so.1 253,0 114496 399177 /lib64/libresolv-2.12.so 253,0 93320 391716 /lib64/libgcc_s-4.4.7-20120601.so.1 253,0 989840 400216 /usr/lib64/libstdc++.so.6.0.13 253,0 17256 392093 /lib64/libcom_err.so.2.1 253,0 12592 406828 /lib64/libkeyutils.so.1.3 253,0 177520 392097 /lib64/libk5crypto.so.3.1 253,0 46368 392717 /lib64/libkrb5support.so.0.1 253,0 280520 392095 /lib64/libgssapi_krb5.so.2.2 253,0 948840 392087 /lib64/libkrb5.so.3.3 253,0 27896 391297 /lib64/libnss_dns-2.12.so 253,0 66432 391299 /lib64/libnss_files-2.12.so 1,3 0t0 4217 /dev/null 253,0 584020 129989 /var/lib/proxysql/proxysql.log 253,0 584020 129989 /var/lib/proxysql/proxysql.log 0,8 0t0 20510499 pipe 0,8 0t0 20510499 pipe 253,0 33792 139105 /var/lib/proxysql/proxysql.db 253,0 33792 139105 /var/lib/proxysql/proxysql.db 0,8 0t0 20510502 pipe 0,8 0t0 20510502 pipe 0,8 0t0 20510503 pipe 0,8 0t0 20510503 pipe 20510505 0t0 TCP *:6033 (LISTEN) 0,9 0 4213 [eventpoll] 13u unix 0xffff88007d1c4540 0t0 20510506 /tmp/proxysql.sock 20510518 0t0 TCP 127.0.0.1:6032 (LISTEN) 15u unix 0xffff88007d006800 0t0 20510519 /tmp/proxysql_admin.sock 21257635 0t0 TCP 10.35.100.197:36666->10.35.78.78:mysql (ESTABLISHED) 21274282 0t0 TCP 10.35.100.197:59106->10.35.71.23:mysql (ESTABLISHED) 21235705 0t0 TCP 10.35.100.197:6033->10.238.236.15:49559 (ESTABLISHED) 0,6 0t0 21235708 can't identify protocol 21269066 0t0 TCP 10.35.100.197:6033->10.238.236.15:56218 (ESTABLISHED) 21269069 0t0 TCP 10.35.100.197:37218->10.35.78.78:mysql (ESTABLISHED)

alejg commented 7 years ago

Hi, Today we put proxysql on our production server, and we are facing the same issue. We haven't loaded any query rules yet

> select @@version 1.3.4-0-gd158707

proxysql

select * from global_variables global.txt

renecannao commented 7 years ago

@alejg: can you please run pt-pmp as described above? Also, would be possible to generate a core dump of the running process using gcore ? Thanks

alejg commented 7 years ago

Yes, proxysql-pt-pmp.txt I'll send the core dump to an email (at sysown.com) ok?

Thanks!

renecannao commented 7 years ago

Thank you @alejg . My email is rene.cannao@gmail.com . One more question: which package are you using? Which OS?

But before sending the core... Looking the output of pt-pmp I have the feeling that proxysql is not spinning in a loop without reason, but actually serving a lot of traffic:

49 send (to backend)
7 send (to client)
4 recv (from client)
3 recv (from backend)
1 send (to backend)
1 recv (from backend)

From the screenshot I can also deduce that mysqld running on the same box is quite busy too. Therefore it is possible that the high CPU utilization you are seeing in ProxySQL is not because of a bug, but because it is really doing a lot of work. Can you please also run these commands and let me know the output?

mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM stats_mysql_connection_pool; SHOW MYSQL STATUS;"
sleep 30
mysql -u admin -padmin -h 127.0.0.1 -P6032 -e "SELECT * FROM stats_mysql_connection_pool; SHOW MYSQL STATUS;"

Thanks

alejg commented 7 years ago

Thank you @renecannao!! CentOS Linux release 7.2.1511 proxysql-1.3.4-1-centos7.x86_64.rpm

You are right, is a very busy server which spends most of his time doing IO. Worked fine for several hours, now the CPU raised up the load average above 18.

Thanks!

renecannao commented 7 years ago

Continuing this thread via email. I will update the issue with further finding.

doddys commented 7 years ago

I am also having this problem intermittently. My Server is idle with

Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 38.9%us, 58.8%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.1%si,  1.2%st
Mem:   8059196k total,  7313936k used,   745260k free,   140376k buffers
Swap:        0k total,        0k used,        0k free,   974752k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                          
30931 proxysql  20   0  318m  86m 2624 S 363.7  1.1 120:20.48 proxysql                                                                                                         
19023 root      20   0  325m  57m 4832 S 14.3  0.7  89:14.53 mysqld_exporter                                                                                                   
28096 mysql     20   0 14.1g 1.2g  33m S  9.6 15.2   5:55.45 mysqld                                                                                                            
 1372 root      20   0  343m  24m 3288 S  8.0  0.3 797:42.80 node_exporter

# sudo strace -c -p 30931
Process 30931 attached
^CProcess 30931 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.72    4.380782      273799        16           nanosleep
  1.28    0.056584       56584         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    4.437366                    17           total

restarting proxysql solves the issue temporarily

renecannao commented 7 years ago

@doddys , if you run strace without -f flag, strace will only check the main thread and nothing more: this is exactly what you got, a strace of the main thread that is doing nothing more but nanosleep() .

How frequently does it happen? Would it be possible to generate a core dump of the running process using gcore ? From the output above it seems it uses really little memory, so a dump should take really a small amount of time (possibly a fraction of second). Thanks

doddys commented 7 years ago

let me try to get a core dump and get back to you

On Mar 14, 2017, at 8:14 PM, René Cannaò notifications@github.com wrote:

@doddys https://github.com/doddys , if you run strace without -f flag, strace will only check the main thread and nothing more: this is exactly what you got, a strace of the main thread that is doing nothing more but nanosleep() .

How frequently does it happen? Would it be possible to generate a core dump of the running process using gcore https://sourceware.org/gdb/onlinedocs/gdb/gcore-man.html ? From the output above it seems it uses really little memory, so a dump should take really a small amount of time (possibly a fraction of second). Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sysown/proxysql/issues/826#issuecomment-286416977, or mute the thread https://github.com/notifications/unsubscribe-auth/AQmB1O57zQ8Kl_xBoS3TxTGV022njG44ks5rlpLNgaJpZM4LGjqA.

doddys commented 7 years ago

@renecannao this is the dump that I get from

gcore -o proxysql.dump 30931

please rename to tgz and extract

proxysql.dump.30931.pdf

renecannao commented 7 years ago

@doddys : thanks a lot. To confirm, this dump is generated while CPU usage was high, or idle? It needs to be while CPU usage is high. Also, can you please let me know exactly which ProxySQL package you have installed? Thanks

doddys commented 7 years ago

@renecannao Yes, it is generated during the cpu ~300% like the top above

# rpm -qa | grep proxy
proxysql-1.3.2-1.1.x86_64
# uname -a
Linux centos-8gb-sgp1-01 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

It happens a few times already but I never pay attention when and how. Let me know know if you need further info. Thanks

renecannao commented 7 years ago

Perfect! I started looking into it, but it took me some time to figure out that the version installed is from Percona repository and from GitHub. What version of CentOS are you using exactly? I tried on CentOS 6.7 but seems not the right one. Thank you

doddys commented 7 years ago

@renecannao yes it is from percona repo. I am using

# cat /etc/redhat-release 
CentOS release 6.8 (Final)

i just notice that scheduler is invoking duplicate process like below

proxysql  7864 30931  1 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7866 30931  1 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7949  7866  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7953  7864  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7955  7949  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 4 2 1 1 /var/lib/proxysql/galera_hg4.log
proxysql  7956  7955  0 13:52 ?        00:00:00 timeout 10 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7957  7955  0 13:52 ?        00:00:00 tail -1
proxysql  7958  7956  5 13:52 ?        00:00:00 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7960  7953  0 13:52 ?        00:00:00 /bin/bash /var/lib/proxysql/proxysql_galera_checker_v2.sh 3 2 1 1 /var/lib/proxysql/galera_hg3.log
proxysql  7961  7960  0 13:52 ?        00:00:00 timeout 10 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
proxysql  7962  7960  0 13:52 ?        00:00:00 tail -1
proxysql  7963  7961  4 13:52 ?        00:00:00 mysql -nNE -uproxysql -h 10.130.34.56 -P 3306 -e SHOW STATUS LIKE 'wsrep_local_state'
root      7965 16029  0 13:52 pts/0    00:00:00 grep proxysql
proxysql 30930     1  0 11:23 ?        00:00:00 /usr/bin/proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql
proxysql 30931 30930 99 11:23 ?        03:56:50 /usr/bin/proxysql -c /etc/proxysql.cnf -D /var/lib/proxysql

Is this normal?

btw for whatever reason the cpu drop back to normal now. it lasted for about 1 hour

renecannao commented 7 years ago

Ok, I will start looking into the dump using 6.8 . With regards to the duplicate processes, there is no mechanism to limit the number of processes started by scheduler. What's the output of SELECT * FROM scheduler ?

doddys commented 7 years ago

below are the ouput

mysql> select * from scheduler;
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| id | active | interval_ms | filename                                        | arg1 | arg2 | arg3 | arg4 | arg5                                          | comment |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| 1  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 0    | 0    | 3    | 1    | /var/lib/proxysql/proxysql_galera_checker.log |         |
| 2  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 2    | 2    | 3    | 1    | /var/lib/proxysql/galera_hg2.log              |         |
| 3  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 3    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg3.log              |         |
| 4  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 4    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg4.log              |         |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
4 rows in set (0.00 sec)

mysql> select * from runtime_scheduler;
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| id | active | interval_ms | filename                                        | arg1 | arg2 | arg3 | arg4 | arg5                                          | comment |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
| 1  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 0    | 0    | 3    | 1    | /var/lib/proxysql/proxysql_galera_checker.log |         |
| 2  | 0      | 10000       | /usr/bin/proxysql_galera_checker                | 2    | 2    | 3    | 1    | /var/lib/proxysql/galera_hg2.log              |         |
| 3  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 3    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg3.log              |         |
| 4  | 1      | 10000       | /var/lib/proxysql/proxysql_galera_checker_v2.sh | 4    | 2    | 1    | 1    | /var/lib/proxysql/galera_hg4.log              |         |
+----+--------+-------------+-------------------------------------------------+------+------+------+------+-----------------------------------------------+---------+
4 rows in set (0.00 sec)

as you can see I have 2 active schedulers, but it spawn more than 1 process per scheduler at the same time.

renecannao commented 7 years ago

maybe it is not spawning more than 1 process per scheduler at the same time. It should spawn a process ever 10 seconds, and check the backends. The fact that from ps you can see two checks against 10.130.34.56 is curious: a check should be very very fast, so if you manage to catch two checks against the same hosts maybe the checks are being slow, and a scheduled process takes longer than 10 seconds to complete, and therefore there is an overlap.

doddys commented 7 years ago

The check is fast. The script finished within 1 second. the weird thing eventhough two process running the same script, I can only see a single set of log entry

Tue Mar 14 14:22:04 UTC 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Mar 14 14:22:04 UTC 2017 Hostgroup writers 4
Tue Mar 14 14:22:04 UTC 2017 Hostgroup readers 2
Tue Mar 14 14:22:04 UTC 2017 Number of writers 1
Tue Mar 14 14:22:04 UTC 2017 Writers are readers 1
Tue Mar 14 14:22:04 UTC 2017 log file /var/lib/proxysql/galera_hg4.log
Tue Mar 14 14:22:05 UTC 2017 ###### HANDLE WRITER NODES ######
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.34.56:3306 is already ONLINE: 1 of 1 write nodes
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.33.225:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.33.225:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:05 UTC 2017 --> Checking WRITE server 4:10.130.33.238:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 4:10.130.33.238:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:05 UTC 2017 ###### HANDLE READER NODES ######
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.34.56:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.33.225:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.33.225:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 --> Checking READ server 2:10.130.33.238:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:05 UTC 2017 server 2:10.130.33.238:3306 is already ONLINE
Tue Mar 14 14:22:05 UTC 2017 ###### SUMMARY ######
Tue Mar 14 14:22:05 UTC 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 4
Tue Mar 14 14:22:05 UTC 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 2
Tue Mar 14 14:22:05 UTC 2017 ###### Loading mysql_servers config into runtime ######

Tue Mar 14 14:22:15 UTC 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Mar 14 14:22:15 UTC 2017 Hostgroup writers 4
Tue Mar 14 14:22:15 UTC 2017 Hostgroup readers 2
Tue Mar 14 14:22:15 UTC 2017 Number of writers 1
Tue Mar 14 14:22:15 UTC 2017 Writers are readers 1
Tue Mar 14 14:22:15 UTC 2017 log file /var/lib/proxysql/galera_hg4.log
Tue Mar 14 14:22:15 UTC 2017 ###### HANDLE WRITER NODES ######
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.34.56:3306 is already ONLINE: 1 of 1 write nodes
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.33.225:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.33.225:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:15 UTC 2017 --> Checking WRITE server 4:10.130.33.238:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 4:10.130.33.238:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Mar 14 14:22:15 UTC 2017 ###### HANDLE READER NODES ######
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.34.56:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.34.56:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.33.225:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.33.225:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 --> Checking READ server 2:10.130.33.238:3306, current status ONLINE, wsrep_local_state 4
Tue Mar 14 14:22:15 UTC 2017 server 2:10.130.33.238:3306 is already ONLINE
Tue Mar 14 14:22:15 UTC 2017 ###### SUMMARY ######
Tue Mar 14 14:22:15 UTC 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 4
Tue Mar 14 14:22:15 UTC 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 2
Tue Mar 14 14:22:15 UTC 2017 ###### Loading mysql_servers config into runtime ######

renecannao commented 7 years ago

@doddys : I couldn't yet find anything odd in the dump. It seems that the instance is indeed idle:

the uptime of this proxy was 6930 seconds
the number of queries executed is 17229
there were 412 connections

What bugs me is servers_table_version = 2544 : that should means that LOAD MYSQL SERVERS TO RUNTIME was executed 2543 times, or about once every 2.72 seconds.

Can you please share proxysql's error log? (/var/lib/proxysql/proxysql.log) I would like to dig into it.

Thank you

doddys commented 7 years ago

@renecannao I did notice that there are a lot of reload in the proxysql.log but I am not sure it was normal or not.

please rename to tgz and extract.

proxysql.log.pdf

renecannao commented 7 years ago

@doddys : there is a bug in proxysql_galera_checker.sh , because LOAD MYSQL SERVERS TO RUNTIME is always called even if no changes are performed. Speaking with @geotro few days ago, I think we have a solution for this. Calling LOAD MYSQL SERVERS TO RUNTIME has some costs, and it is possible that calling it too often has unexpected side effect. I will investigate more. Thank you for the file!

renecannao commented 7 years ago

@doddys , interesting finding! Number of LOAD MYSQL SERVERS per hour:

$ egrep '^2017-03-14 .*LOAD MYSQL SERVERS' proxysql.log | cut -b 1-13 | sort | uniq -c
 714 2017-03-14 00
 706 2017-03-14 01
 704 2017-03-14 02
 704 2017-03-14 03
 706 2017-03-14 04
 702 2017-03-14 05
 704 2017-03-14 06
 704 2017-03-14 07
 842 2017-03-14 08
1408 2017-03-14 09
1416 2017-03-14 10
1408 2017-03-14 11
1350 2017-03-14 12
1011 2017-03-14 13
 708 2017-03-14 14
 706 2017-03-14 15
 710 2017-03-14 16
 710 2017-03-14 17
 714 2017-03-14 18
 712 2017-03-14 19
 712 2017-03-14 20
 714 2017-03-14 21
 714 2017-03-14 22
  74 2017-03-14 23

From the above, it seems that something happens between 8am and 9am, and something changed again between 1pm and 2pm. This is indeed the case:

$ egrep -v '^HID' proxysql.log | egrep -v '(Received LOAD MYSQL SERVERS TO RUNTIME command|New mysql_replication_hostgroups table)' | egrep '^2017-03-14.*INFO'
2017-03-14 08:47:18 [INFO] Received save scheduler from mem command
2017-03-14 08:47:25 [INFO] Received load scheduler from mem command
2017-03-14 08:47:36 [INFO] Received save mysql servers from mem command
2017-03-14 08:47:47 [INFO] Received load mysql servers from mem command
2017-03-14 08:49:46 [INFO] Received save mysql users from mem command
2017-03-14 08:49:59 [INFO] Received load mysql users from mem command
2017-03-14 08:51:18 [INFO] Received load mysql query rules from mem command
2017-03-14 08:51:29 [INFO] Received save mysql query rules from mem command
2017-03-14 08:54:19 [INFO] Received load scheduler from mem command
2017-03-14 08:54:24 [INFO] Received save scheduler from mem command
2017-03-14 08:56:42 [INFO] Received load scheduler from mem command
2017-03-14 08:56:47 [INFO] Received save scheduler from mem command
2017-03-14 08:59:38 [INFO] Received load scheduler from mem command
2017-03-14 08:59:41 [INFO] Received save scheduler from mem command
2017-03-14 11:23:31 [INFO] Starting shutdown...
2017-03-14 11:23:32 [INFO] Shutdown completed!
2017-03-14 11:23:32 [INFO] Exiting...
2017-03-14 11:23:32 [INFO] Shutdown angel process
2017-03-14 11:23:32 [INFO] Starting ProxySQL
2017-03-14 11:23:32 [INFO] Sucessfully started
2017-03-14 11:23:32 [INFO] Angel process started ProxySQL process 30931
2017-03-14 12:51:55 [INFO] Received load scheduler from mem command
2017-03-14 12:52:08 [INFO] Received save scheduler to mem command
2017-03-14 13:52:14 [INFO] Received load scheduler from mem command
2017-03-14 13:52:19 [INFO] Received save scheduler from mem command

Scheduler entries changed:

between 8:47 and 8:59am: they probably doubled
at 12:51:55 : probably no real changes
at 13:52:14 : they probably halves

ProxySQL was restarted at 11:23:32 , that means that the core dump you sent was generated at 13:19:02 (6930 seconds after 11:23:32).

In https://github.com/sysown/proxysql/issues/826#issuecomment-286429201 you commented that :

you were seeing double processes at 13:52
the issue is now solved, after about 1 hour

This is actually very interesting, because commands related to the scheduler were executed at 12:52ish and 13:52ish : exactly one hour apart. Do you recall: a) what did you change in the scheduler at these times? b) if you run ps before of after running the commands on the scheduler?

Do you mind sharing the files /var/lib/proxysql/galera_hg*.log ? I would like to see if I can find anything else interesting during this time window.

renecannao commented 7 years ago

Looking at the number of LOAD MYSQL SERVERS per hour I actually missed a detail, and I should have checked more granularity, using:

egrep '^2017-03-14 1[1234].*LOAD MYSQL SERVERS' proxysql.log | cut -b 1-16 | sort | uniq -c

The number of LOAD MYSQL SERVERS per minute:

24 calls/min till 12:51
18 calls/min from 12:52 till 13:51
12 calls/min from 13:52

@doddys : can you please try this modified version of proxysql_galera_checker.sh ? proxysql_galera_checker-v2.sh.zip

@taylor840326 , @aydancoskun : are you also using proxysql_galera_checker.sh ?

doddys commented 7 years ago

@renecannao I did make some config changes after proxy restart at 11 ish ...

I applied the script that your provided. I notice a few things differ from previous behavior

password is pass through command which give a warning mysql: [Warning] Using a password on the command line interface can be insecure.
ONLINE writer is not flagged based on weight anymore.

# watch -n 1 'mysql --login-path=proxysql -t -e "select * from stats_mysql_connection_pool where hostgroup in (2,3,4,94,92,93) order by hostgroup,srv_host ;" -e " select hostgroup_id,hostname,status,weight,comment from mysql_servers where hostgroup_id in (2,3,4)  order by hostgroup_id,hostname ;"'

Every 1.0s: mysql --login-path=proxysql -t -e "select * from stats_mysql_connection_pool where hostgroup in (2,3,4,94,92,93) order by hostgrou...  Wed Mar 15 02:23:21 2017

+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host      | srv_port | status       | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 2         | 10.130.33.225 | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 2         | 10.130.33.238 | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 1384       |
| 2         | 10.130.34.56  | 3306     | ONLINE       | 0        | 13       | 13     | 0       | 19678   | 519672          | 186768          | 447        |
| 3         | 10.130.33.225 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 3         | 10.130.33.238 | 3306     | ONLINE       | 200  | 0        | 594    | 0       | 3806    | 460314          | 2670117         | 1384       |
| 3         | 10.130.34.56  | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 447        |
| 4         | 10.130.33.225 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 205        |
| 4         | 10.130.33.238 | 3306     | OFFLINE_SOFT | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 1384       |
| 4         | 10.130.34.56  | 3306     | ONLINE       | 0        | 0        | 0      | 0       | 0   | 0               | 0               | 447        |
+-----------+---------------+----------+--------------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
+--------------+---------------+--------------+------------+-----------+
| hostgroup_id | hostname      | status       | weight     | comment   |
+--------------+---------------+--------------+------------+-----------+
| 2            | 10.130.33.225 | ONLINE       | 1          | PRIMARY   |
| 2            | 10.130.33.238 | ONLINE       | 1          | BACKUP    |
| 2            | 10.130.34.56  | ONLINE       | 1000000000 | BACKUP    |
| 3            | 10.130.33.225 | ONLINE       | 1          | BACKUP    |
| 3            | 10.130.33.238 | OFFLINE_SOFT | 1000000000 | PRIMARY   |
| 3            | 10.130.34.56  | OFFLINE_SOFT | 1          | BACKUP    |
| 4            | 10.130.33.225 | ONLINE       | 100000     | SECONDARY |
| 4            | 10.130.33.238 | OFFLINE_SOFT | 1          | BACKUP    |
| 4            | 10.130.34.56  | OFFLINE_SOFT | 100000000  | PRIMARY   |
+--------------+---------------+--------------+------------+-----------+

Server logs status is proxysql.log seems to dissapear

attached is the galera_hg logs

galera_hg.pdf

anyway, if the cpu spike again, is there anything you would like me to capture?

sysown / proxysql

proxysql high cpu usage after 24 hours #826