Open salarali opened 3 weeks ago
It could be related that after some time, all session migrations stop all of a sudden.
from={the backend instance}
in the PromQL in the Grafana panel to filter the migrations from this backend. E.g. sum(increase(tiproxy_balance_migrate_total{k8s_cluster="$k8s_cluster", tidb_cluster="$tidb_cluster", instance=~"$instance", from="127.0.0.1:3080"}[1m])) by (reason)
update backend
to see the status change of backends. E.g. [update backend] [namespace=default] [backend_addr=[10.90.2.153:4000](http://10.90.2.153:4000/)] [prev=healthy] [cur="down, err: http status 500: connect status port failed"]
indicates that the backend 10.90.2.153:4000
is currently down because the 10080 status cannot be connected.I hope that will help.
So I see that the connections got moved away from it because of memory factor, but after that, no logs related to that backend at all. No update backend
. The node is healthy as well. Memory use is also very low, no reason why it should not get more connections moved over to it.
Can you query the metrics that TiProxy has read?
Server -> Owner
, get the instance with label metric_reader
.curl 127.0.0.1:3080/api/backend/metrics
. The result contains the current metrics of all backends. It may be hard to read. You can paste it here.Let me check those out.
I also added logs here: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L311
if balanceCount > 0 {
router.logger.Info("backends to balance", zap.String("from", busiestBackend.Addr()), zap.String("to", idlestBackend.Addr()), zap.String("reason", reason))
}
And I am getting these logs there:
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.159 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.169 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.179 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.189 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.199 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.209 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.219 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.229 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
Aug 28 03:27:56 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:27:56.239 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.89.188:5678] [reason=status]
172.30.72.136
is dead and gone now. I guess that is the reason balancing just stops after a while.
Please try to connect to the 4000 and 10080 ports. curl 127.0.0.1:10080/status
.
BTW, you don't have to add logs, you can just turn the log level to debug
.
Curl commands from the tiproxy node:
curl 172.30.72.136:1008/status
curl: (7) Failed to connect to 172.30.72.136 port 1008 after 3131 ms: Couldn't connect to server
This is the dead node that is keeps trying to move connections away from. I use port 1008 instead of 10080.
Metrics from tiproxy:
I didn't see 172.30.72.136
in the metrics. Maybe the instance is unavailable.
Yeah. The server is actually not there anymore. The autoscaling removed it. But for some reason, tiproxy was unable to detect that. I see this in the logs for tiproxy:
Aug 28 01:16:58 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:16:58.584 +00:00] [INFO] [main.nsmgr.observer] [observer/backend_observer.go:200] [update backend] [namespace=default] [backend_addr=172.30.72.136:5678] [prev="down, err: Get \"http://172.30.72.136:1008/status\": dial tcp 172.30.72.136:1008: connect: connection refused: connect status port failed"] [cur=healthy]
But I still see this after this:
Aug 28 01:17:05 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:17:05.010 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.76.23:5678] [to=172.30.72.136:5678] [reason=cpu]
Aug 28 01:17:05 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:17:05.019 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.76.23:5678] [to=172.30.72.136:5678] [reason=cpu]
So for some reason, its list of backends didnt update properly.
Yeah. The server is actually not there anymore. The autoscaling removed it. But for some reason, tiproxy was unable to detect that. I see this in the logs for tiproxy:
Aug 28 01:16:58 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:16:58.584 +00:00] [INFO] [main.nsmgr.observer] [observer/backend_observer.go:200] [update backend] [namespace=default] [backend_addr=172.30.72.136:5678] [prev="down, err: Get \"http://172.30.72.136:1008/status\": dial tcp 172.30.72.136:1008: connect: connection refused: connect status port failed"] [cur=healthy]
But I still see this after this:
Aug 28 01:17:05 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:17:05.010 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.76.23:5678] [to=172.30.72.136:5678] [reason=cpu] Aug 28 01:17:05 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 01:17:05.019 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.76.23:5678] [to=172.30.72.136:5678] [reason=cpu]
So for some reason, its list of backends didnt update properly.
The first log indicates that the instance was unhealthy but now is healthy. prev
indicates the previous status and cur
indicates the current status. I guess there should be another update backend
afterward saying that it's down again.
Right. You are correct. Sorry.
Aug 28 03:08:07 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:08:07.288 +00:00] [INFO] [main.nsmgr.observer] [observer/backend_observer.go:215] [update backend] [namespace=default] [backend_addr=172.30.72.136:5678] [prev=healthy] [cur="down, err: http status 500: connect status port failed"]
Aug 28 03:08:07 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:08:07.289 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.77.5:5678] [reason=status]
Aug 28 03:08:07 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:08:07.299 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.81.38:5678] [reason=status]
Aug 28 03:08:07 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:08:07.309 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.81.38:5678] [reason=status]
Aug 28 03:08:07 ip-172-30-94-236.ec2.internal tiproxy[3947488]: [2024/08/28 03:08:07.319 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:329] [backends to balance] [namespace=default] [from=172.30.72.136:5678] [to=172.30.81.38:5678] [reason=status]
After this failure, the reason did change to status.
If there's no more problem, I'll close this issue.
So, the issue is still there. Even though tiproxy detected that the backend is down, its still trying to route from it. My naive guess is that this did not get run: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L298
I'm still testing this out to see if this is an issue, but one possible issue I can see is here: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L277-L283
As far as I can see, the backend wont be removed if there are still connections associated to it. Is that a possibility?
I will let you know if I confirm this hypothesis.
It seems like it is possible. I put some extra logs in there and I see this:
Aug 28 04:55:09 ip-172-30-94-236.ec2.internal tiproxy[4064141]: [2024/08/28 04:55:09.214 +00:00] [INFO] [main.nsmgr.router] [router/router_score.go:291] [current backends] [namespace=default] [addr=172.30.77.5:5678] [healthy=false] [connScore=-3] [connListLen=5]
I put this log:
for addr, backend := range router.backends {
router.logger.Info("current backends", zap.String("addr", addr), zap.Bool("healthy", backend.Healthy()), zap.Int("connScore", backend.connScore), zap.Int("connListLen", backend.connList.Len()))
}
in this function: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L250
Even though tiproxy detected that the backend is down, its still trying to route from it.
It's expected. TiProxy should migrate connections from the unhealthy backend. Why shouldn't it? The problem should be that the connection migrations never succeed. If so, the errors should be logged with redirect connection failed
.
[connScore=-3] [connListLen=5]
This is unexpected. The connScore
should always be >=0. connScore
= connections on it
+ connections migrating to it
- connections migrating from it
. connections migrating from it
should never be greater than connections on it
.
Now that you can reproduce the problem, can you help to locate the reason?
Right, I am getting those errors.
Aug 28 10:19:30 ip-172-30-94-236.ec2.internal tiproxy[4092693]: [2024/08/28 10:19:30.290 +00:00] [WARN] [main.proxy.conn.be] [backend/backend_conn_mgr.go:598] [redirect connection failed] [connID=1237819] [client_addr=172.30.72.32:34772] [addr=0.0.0.0:5678] [ns=default] [from=172.30.78.14:5678] [to=172.30.87.124:5678] [redirect_err="this is an error from the backend connection: failed to read the connection: EOF"] [redirect_errVerbose="this is an error from the backend connection: failed to read the connection: EOF\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*basicReadWriter).Read\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:105\ngithub.com/pingcap/tiproxy/pkg/proxy/net.ReadFull\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:183\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*PacketIO).readOnePacket\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:247\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*PacketIO).ReadPacket\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:270\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*CmdProcessor).query\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/cmd_processor_query.go:30\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).querySessionStates\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:431\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).tryRedirect\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:510\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).processSignals.func1\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:458\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).processSignals\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:460\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).Connect.func1\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:214\ngithub.com/pingcap/tiproxy/lib/util/waitgroup.(*WaitGroup).RunWithRecover.func1\n\t/Users/salar/repos/tiproxy/lib/util/waitgroup/waitgroup.go:38\nruntime.goexit\n\t/opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/asm_arm64.s:1222"]
Aug 28 10:19:30 ip-172-30-94-236.ec2.internal tiproxy[4092693]: [2024/08/28 10:19:30.343 +00:00] [WARN] [main.proxy.conn.be] [backend/backend_conn_mgr.go:598] [redirect connection failed] [connID=1239985] [client_addr=172.30.72.32:36640] [addr=0.0.0.0:5678] [ns=default] [from=172.30.78.14:5678] [to=172.30.87.124:5678] [redirect_err="this is an error from the backend connection: failed to read the connection: EOF"] [redirect_errVerbose="this is an error from the backend connection: failed to read the connection: EOF\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*basicReadWriter).Read\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:105\ngithub.com/pingcap/tiproxy/pkg/proxy/net.ReadFull\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:183\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*PacketIO).readOnePacket\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:247\ngithub.com/pingcap/tiproxy/pkg/proxy/net.(*PacketIO).ReadPacket\n\t/Users/salar/repos/tiproxy/pkg/proxy/net/packetio.go:270\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*CmdProcessor).query\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/cmd_processor_query.go:30\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).querySessionStates\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:431\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).tryRedirect\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:510\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).processSignals.func1\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:458\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).processSignals\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:460\ngithub.com/pingcap/tiproxy/pkg/proxy/backend.(*BackendConnManager).Connect.func1\n\t/Users/salar/repos/tiproxy/pkg/proxy/backend/backend_conn_mgr.go:214\ngithub.com/pingcap/tiproxy/lib/util/waitgroup.(*WaitGroup).RunWithRecover.func1\n\t/Users/salar/repos/tiproxy/lib/util/waitgroup/waitgroup.go:38\nruntime.goexit\n\t/opt/homebrew/Cellar/go/1.22.4/libexec/src/runtime/asm_arm64.s:1222"]
I will debug more and try to figure out the reason
These errors should not be related to the problem. When the query fails, the connections are disconnected and then the router clears them. https://github.com/pingcap/tiproxy/blob/v1.2.0/pkg/proxy/backend/backend_conn_mgr.go#L513
One thing I wanted to run by you. The reason why connScore might be negative, might be because of this: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L385
It decrements and increments the score, but the redirect can fail: https://github.com/pingcap/tiproxy/blob/main/pkg/proxy/backend/backend_conn_mgr.go#L595
That has fixed that the connScore is never negative for me, but I am still getting connLen to be greater than 0 for bad nodes, so still investigating that.
One thing I wanted to run by you. The reason why connScore might be negative, might be because of this: https://github.com/pingcap/tiproxy/blob/main/pkg/balance/router/router_score.go#L385
It decrements and increments the score, but the redirect can fail: https://github.com/pingcap/tiproxy/blob/main/pkg/proxy/backend/backend_conn_mgr.go#L595
Yes, that's possible, it should be handled. Appreciate that.
Bug Report
I have this issue sometimes that 1 node fails to be balanced properly. It happens randomly and I see no error in logs.
The connection count remains very low for a long time. How should I go about debugging this? I checked the server and server is up and running.
2. What did you expect to see? (Required)
3. What did you see instead (Required)
4. What is your version? (Required)
1.2.0