Open icymoon opened 6 years ago
@icymoon How about your script to monitor sa_pool
or other details ? is it easy to reproduce or not ?
@ywc689 pls help to check the msg issue.
It easy to reproduce with workload. Monitors call dpip, ipvsadm command with different parameters.
每段开始出问题的时间,都是以这样的消息开头的 MSGMGR: multicast_msg_send: mc_msg(type:11, cid:0->slaves) timeout(2000 us), drop... SERVICE: dp_vs_copy_stats: send message fail. 或者 msg_send: uc_msg(type:8, cid:0->13, flags=2) timeout(2000 us), drop... NETIF: [netif_sockopt_get] msg dropped for netif sockmsg opt 502)
前者的概率更大些似乎
As discussed with @icymoon , to reproduce it:
dpip addr show
with dead loop: while [ true ]; dpip addr show; done > xxx.log
reproduced with this work load: # ./wrk -t26 -c5000 -d15m http://xxxx/ Running 15m test @ http://xxx/ 26 threads and 5000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 4.01ms 48.57ms 2.00s 99.11% Req/Sec 34.96k 19.43k 155.07k 73.96% 813153656 requests in 15.00m, 583.85GB read Socket errors: connect 3332, read 0, write 0, timeout 16561 Requests/sec: 903409.67 Transfer/sec: 664.22MB
master-to-all-slaves round trip message
, i.e. the master lcore send messages to all slaves, and wait until all slaves reply with a response message if the message is blockable. So the message would time out if any lcore get slower than 2000us/2, which is very likely to happen in the stress test. Increase the msg timeout in the dpvs.conf
may helps.repeated sequence number
is a result of that dpvs
is too busy to process the frequently generated dpip
commands of the same type. We can make use of the seq
field in dpvs_msg
to solve the problem. But the root cause still is dpvs
get slower than dpip
.谢谢@ywc689 或者我只尝试读per core的计数然后在用户态加起来,可以规避这个问题?
@icymoon 你把dpvs.conf配置文件中ctrl_defs/lcore_msg/sync_msg_timeout_us这个超时时间调大一些,不让系统出现msg timeout,就应该可以拿到统计数据了。 这个问题可能是由于个别lcore处理循环中出现了卡顿或者压力过大,我们会进一步排查原因。
谢谢@ywc689 统计数据抖一下没关系,最多是图不好看,我最怕的是处理卡顿意味着有稳定性或者性能隐患。。。
Here is log, several types: multicast_msg_send: mc_msg(type:12, cid:0->slaves) timeout(2000 us), drop... SAPOOL: sa_pool_stats: mc msg send fail: msg dropped MSGMGR: multicast_msg_send: repeated sequence number for multicast msg: type 12, seq 0 SAPOOL: sa_pool_stats: mc msg send fail: invalid parameter MSGMGR: multicast_msg_send: repeated sequence number for multicast msg: type 12, seq 0 SAPOOL: sa_pool_stats: mc msg send fail: invalid parameter
NETIF: [netif_sockopt_get] msg dropped for netif sockmsg opt 502)
MSGMGR: sockopt_ctl: socket msg callback failed
MSGMGR: [sockopt_msg_send:msg#502] errcode set in sockopt msg reply: msg dropped