Open cezmunsta opened 6 years ago
Low wait_timeout value causes the problem. While Orchestrator often uses its database connections and thus prevents server-side timeouts, I found that occasionally a random connection goes unused for a longer period and that's likely when these intermittent problems occur. My workaround was to add a new configuration variable to Orchestrator passed to SetConnMaxLifetime()
, which sets up the maximum connection lifetime in the Go MySQL driver. Effectively, Go continuously recycles its connections and no timeouts can occur. However, this probably needs a better solution, i.e.:
A few additional details observed during my invalid connection
error investigation:
"Got an error reading comunication packets"
error is logged, coming from the RAFT leader, every time an invalid connection
message was displayed in Orchestrator GUI.[root@myserver ~]# tshark -d tcp.port==3306,mysql -i eth0 'port 3306 and dst host 10.20.0.9'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
........
330 98.597415229 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
331 98.597436206 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
332 98.597446900 10.60.0.3 -> 10.20.0.9 TCP 54 3306 > 58114 [RST] Seq=1 Win=0 Len=0
333 98.599564225 10.60.0.3 -> 10.20.0.9 MySQL 245 Response
334 98.600035296 10.60.0.3 -> 10.20.0.9 MySQL 671 Response
........
347 98.602735670 10.60.0.3 -> 10.20.0.9 MySQL 190 Response
348 98.603062052 10.60.0.3 -> 10.20.0.9 MySQL 131 Response
349 98.628864686 10.60.0.3 -> 10.20.0.9 TCP 74 3306 > 58180 [SYN, ACK] Seq=0 Ack=1 Win=28160 Len=0 MSS=1420 SACK_PERM=1 TSval=604389905 TSecr=2840752718 WS=128
350 98.629440145 10.60.0.3 -> 10.20.0.9 MySQL 151 Server Greeting proto=10 version=5.7.20-19-log
351 98.629792334 10.60.0.3 -> 10.20.0.9 TCP 66 3306 > 58180 [ACK] Seq=86 Ack=93 Win=28160 Len=0 TSval=604389906 TSecr=2840752720
352 98.629888371 10.60.0.3 -> 10.20.0.9 MySQL 77 Response OK
.......
383 107.996282468 10.60.0.3 -> 10.20.0.9 MySQL 131 Response
384 107.996421178 10.60.0.3 -> 10.20.0.9 MySQL 158 Response
385 107.997194466 10.60.0.3 -> 10.20.0.9 MySQL 190 Response
386 107.997562224 10.60.0.3 -> 10.20.0.9 MySQL 129 Response
387 108.640109964 10.60.0.3 -> 10.20.0.9 TCP 66 3306 > 58180 [FIN, ACK] Seq=3347 Ack=131 Win=28160 Len=0 TSval=604399916 TSecr=2840752721
The problem went away when the change in go/db/db.go described in #577 was implemented
I still appear to be able to trigger this issue:
$ sudo /usr/local/orchestrator/orchestrator --version
3.0.13
603ef0df49be700607c371f9fc6fa4d12615cd21
$ sudo orchestrator-client -c stop-slave -i 10.xxx.xxx.20
10.xxx.xxx.20:3306
$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.20
10.xxx.xxx.20:3306
$ sudo orchestrator-client -c topology -i 10.xxx.xxx.20
10.xxx.xxx.24:3306 [0s,ok,5.7.23-log,rw,ROW,>>]
+ 10.xxx.xxx.20:3306 [0s,ok,5.7.22-22-log,ro,ROW,>>]
$ sudo orchestrator-client -c graceful-master-takeover -i 10.xxx.xxx.20
10.xxx.xxx.20:3306
$ sudo orchestrator-client -c topology -i 10.xxx.xxx.20
10.xxx.xxx.20:3306 [0s,ok,5.7.22-22-log,rw,ROW,>>]
- 10.xxx.xxx.24:3306 [null,nonreplicating,5.7.23-log,ro,ROW,>>,downtimed]
$ sudo orchestrator-client -c end-downtime -i 10.xxx.xxx.24
10.xxx.xxx.24:3306
$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.24
invalid connection
$ db_connect 10.xxx.xxx.24 -Bse "select @@global.wait_timeout"
30
$ db_connect 10.xxx.xxx.20 -Bse "select @@global.wait_timeout"
30
$ sudo egrep -i timeout\|poll\|expire /etc/orchestrator.conf.json
"InstancePollSeconds": 10,
"InstanceBulkOperationsWaitTimeoutSeconds": 10,
"CandidateInstanceExpireMinutes": 60,
"AgentPollMinutes": 60,
$ sudo orchestrator-client -c start-slave -i 10.xxx.xxx.24
invalid connection
$ sudo fgrep MySQLConnectionLifetimeSeconds /etc/orchestrator.conf.json
"MySQLConnectionLifetimeSeconds": 43200,
If I set the lifetime to less that the wait_timeout
I do not seem to be able to trigger the invalid connection
issue, which makes sense.
The best solution here would seem to be auto-detect the wait_timeout
and set the connection lifetime to just less than, this closing connections ahead of timeouts. If MySQLConnectionLifetimeSeconds
is set then use this value to set wait_timeout = ?
and decrement to set the connection limit.
auto-detect the wait_timeout and set the connection lifetime to just less than,
Good idea!
The handling of the loss of a database connection seems to behave unexpectedly whilst interacting via the client. This is an intermittent, yet persistent issue that I have noticed:
Extracts of the log show that it seems to be related to
packets.go
:A quick retry will normally respond as expected:
Extract of the log shows the absence of the
ERROR invalid connection
message, despite an error inconnection.go
Performing a TCP dump seems to imply that when the connection is reset, the error appears:
The connectivity timeouts configured in MySQL are:
In case there is anything related to the
orchestrator
configuration here are all of the time-based settings: