openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 933 forks source link

Intermittent "invalid connection" #559

Open cezmunsta opened 6 years ago

cezmunsta commented 6 years ago

The handling of the loss of a database connection seems to behave unexpectedly whilst interacting via the client. This is an intermittent, yet persistent issue that I have noticed:

$ sudo orchestrator-client -c stop-slave -i 192.168.56.21:3306
invalid connection

Extracts of the log show that it seems to be related to packets.go:

Jul 16 09:53:16 monitor-ci orchestrator[933]: [martini] Started GET /api/stop-replica/192.168.56.21/3306 for 127.0.0.1:56304
Jul 16 09:53:16 monitor-ci orchestrator[933]: [mysql] 2018/07/16 09:53:16 packets.go:36: unexpected EOF
Jul 16 09:53:16 monitor-ci orchestrator[933]: 2018-07-16 09:53:16 ERROR invalid connection
Jul 16 09:53:16 monitor-ci orchestrator[933]: 2018-07-16 09:53:16 ERROR invalid connection
Jul 16 09:53:16 monitor-ci orchestrator[933]: [martini] Completed 500 Internal Server Error in 60.786965ms

A quick retry will normally respond as expected:

$ sudo orchestrator-client -c stop-slave -i 192.168.56.21:3306 
192.168.56.21:3306

Extract of the log shows the absence of the ERROR invalid connection message, despite an error in connection.go

Jul 16 09:53:19 monitor-ci orchestrator[933]: [martini] Started GET /api/stop-replica/192.168.56.21/3306 for 127.0.0.1:56310
Jul 16 09:53:19 monitor-ci orchestrator[933]: [mysql] 2018/07/16 09:53:19 connection.go:311: invalid connection
Jul 16 09:53:19 monitor-ci orchestrator[933]: 2018-07-16 09:53:19 INFO Stopped slave on 192.168.56.21:3306, Self:mysql-bin.000044:194, Exec:mysql-bin.000045:194
Jul 16 09:53:19 monitor-ci orchestrator[933]: [martini] Completed 200 OK in 70.114802ms

Performing a TCP dump seems to imply that when the connection is reset, the error appears:

# tshark -r /tmp/pcap.out -e frame.time -e ip.src -e tcp.srcport -e ip.dst -e tcp.dstport -e tcp.flags -Tfields  | egrep 004\$
"Jul 16, 2018 09:53:16.196603000 UTC"   192.168.56.21   3306    192.168.56.10   43300   0x00000004
"Jul 16, 2018 09:53:16.196642000 UTC"   192.168.56.21   3306    192.168.56.10   43300   0x00000004
"Jul 16, 2018 09:53:16.196648000 UTC"   192.168.56.21   3306    192.168.56.10   43300   0x00000004
"Jul 16, 2018 09:53:16.196726000 UTC"   192.168.56.21   3306    192.168.56.10   43300   0x00000004

The connectivity timeouts configured in MySQL are:

> show global variables where variable_name rlike "^(wait|net_read|net_write)_timeout";
net_read_timeout        30
net_write_timeout       60
wait_timeout    60

In case there is anything related to the orchestrator configuration here are all of the time-based settings:

$ sudo egrep 'Second|Minute|Hour|Day' /etc/orchestrator.conf.json
  "InstancePollSeconds": 10,
  "UnseenInstanceForgetHours": 24,
  "SnapshotTopologiesIntervalHours": 0,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "ExpiryHostnameResolvesMinutes": 60,
  "ReasonableReplicationLagSeconds": 10,
  "ReasonableMaintenanceReplicationLagSeconds": 300,
  "CandidateInstanceExpireMinutes": 60,
  "AgentPollMinutes": 60,
  "UnseenAgentForgetHours": 6,
  "StaleSeedFailMinutes": 60,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 3600,
  "MasterFailoverLostInstancesDowntimeMinutes": 0,
  "PostponeSlaveRecoveryOnLagMinutes": 0,
dbnski commented 6 years ago

Low wait_timeout value causes the problem. While Orchestrator often uses its database connections and thus prevents server-side timeouts, I found that occasionally a random connection goes unused for a longer period and that's likely when these intermittent problems occur. My workaround was to add a new configuration variable to Orchestrator passed to SetConnMaxLifetime(), which sets up the maximum connection lifetime in the Go MySQL driver. Effectively, Go continuously recycles its connections and no timeouts can occur. However, this probably needs a better solution, i.e.:

pythian-ciciliani commented 6 years ago

A few additional details observed during my invalid connection error investigation:

[root@myserver ~]# tshark -d tcp.port==3306,mysql -i eth0 'port 3306 and dst host 10.20.0.9' 
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'

........
330 98.597415229 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
331 98.597436206 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
332 98.597446900 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
333 98.599564225 10.60.0.3 -> 10.20.0.9 MySQL 245 Response
334 98.600035296 10.60.0.3 -> 10.20.0.9 MySQL 671 Response
........
347 98.602735670 10.60.0.3 -> 10.20.0.9 MySQL 190 Response
348 98.603062052 10.60.0.3 -> 10.20.0.9 MySQL 131 Response
349 98.628864686 10.60.0.3 -> 10.20.0.9 TCP 74 3306 > 58180 [SYN, ACK] Seq=0 Ack=1 Win=28160 Len=0 MSS=1420 SACK_PERM=1 TSval=604389905 TSecr=2840752718 WS=128
350 98.629440145 10.60.0.3 -> 10.20.0.9 MySQL 151 Server Greeting proto=10 version=5.7.20-19-log
351 98.629792334 10.60.0.3 -> 10.20.0.9 TCP 66 3306 > 58180 [ACK] Seq=86 Ack=93 Win=28160 Len=0 TSval=604389906 TSecr=2840752720
352 98.629888371 10.60.0.3 -> 10.20.0.9 MySQL 77 Response OK
.......
383 107.996282468 10.60.0.3 -> 10.20.0.9 MySQL 131 Response
384 107.996421178 10.60.0.3 -> 10.20.0.9 MySQL 158 Response
385 107.997194466 10.60.0.3 -> 10.20.0.9 MySQL 190 Response
386 107.997562224 10.60.0.3 -> 10.20.0.9 MySQL 129 Response
387 108.640109964 10.60.0.3 -> 10.20.0.9 TCP 66 3306 > 58180 [FIN, ACK] Seq=3347 Ack=131 Win=28160 Len=0 TSval=604399916 TSecr=2840752721

The problem went away when the change in go/db/db.go described in #577 was implemented

cezmunsta commented 5 years ago

I still appear to be able to trigger this issue:

$ sudo /usr/local/orchestrator/orchestrator --version
3.0.13
603ef0df49be700607c371f9fc6fa4d12615cd21

$ sudo orchestrator-client -c stop-slave -i 10.xxx.xxx.20
10.xxx.xxx.20:3306

$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.20                                                                                                                    
10.xxx.xxx.20:3306

$ sudo orchestrator-client -c topology -i 10.xxx.xxx.20
10.xxx.xxx.24:3306   [0s,ok,5.7.23-log,rw,ROW,>>]
+ 10.xxx.xxx.20:3306 [0s,ok,5.7.22-22-log,ro,ROW,>>]

$ sudo orchestrator-client -c graceful-master-takeover -i 10.xxx.xxx.20
10.xxx.xxx.20:3306

$ sudo orchestrator-client -c topology -i 10.xxx.xxx.20                                                                                                                       
10.xxx.xxx.20:3306   [0s,ok,5.7.22-22-log,rw,ROW,>>]
- 10.xxx.xxx.24:3306 [null,nonreplicating,5.7.23-log,ro,ROW,>>,downtimed]

$ sudo orchestrator-client -c end-downtime -i 10.xxx.xxx.24
10.xxx.xxx.24:3306

$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.24                                                                                                                    
invalid connection

$ db_connect 10.xxx.xxx.24 -Bse "select @@global.wait_timeout"
30

$ db_connect 10.xxx.xxx.20 -Bse "select @@global.wait_timeout"
30

$ sudo egrep -i timeout\|poll\|expire /etc/orchestrator.conf.json
  "InstancePollSeconds": 10,
  "InstanceBulkOperationsWaitTimeoutSeconds": 10,
  "CandidateInstanceExpireMinutes": 60,
  "AgentPollMinutes": 60,

$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.24                                                                                                                          
invalid connection

$ sudo fgrep MySQLConnectionLifetimeSeconds /etc/orchestrator.conf.json 
  "MySQLConnectionLifetimeSeconds": 43200,
cezmunsta commented 5 years ago

If I set the lifetime to less that the wait_timeout I do not seem to be able to trigger the invalid connection issue, which makes sense.

The best solution here would seem to be auto-detect the wait_timeout and set the connection lifetime to just less than, this closing connections ahead of timeouts. If MySQLConnectionLifetimeSeconds is set then use this value to set wait_timeout = ? and decrement to set the connection limit.

shlomi-noach commented 5 years ago

auto-detect the wait_timeout and set the connection lifetime to just less than,

Good idea!

shlomi-noach commented 5 years ago

Please see https://github.com/github/orchestrator/pull/746