Open askfongjojo opened 4 days ago
Intra-VPC single-thread iperf3 throughput and number of retransmits are still on par with the previous build on rack2. Another disk I/O SQL server workload (with sysbench as load generator co-located with the database in the same VM) also hasn't shown perf degradation.
I also checked the TCP session queues on the load generator and MongoDB primary. The loadgen send queue length for each of the threads stays between 0-1400 requests and those numbers haven't increased whereas the DB primary has a very small queue length:
ubuntu@loadgen:/opt/local/ycsb/ycsb-mongodb-binding-0.17.0$ netstat -an | egrep 'Address|ESTABLISH'
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 36 172.30.0.6:22 172.20.17.42:51281 ESTABLISHED
tcp6 0 1393 172.30.0.6:51728 172.30.0.7:27017 ESTABLISHED
tcp6 0 0 172.30.0.6:51712 172.30.0.7:27017 ESTABLISHED
tcp6 0 1393 172.30.0.6:51724 172.30.0.7:27017 ESTABLISHED
ubuntu@primary:~$ netstat -an | egrep 'Address|ESTABLISH'
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 127.0.1.1:27017 127.0.0.1:54468 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.6:42206 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:59938 ESTABLISHED
tcp 0 0 172.30.0.7:43630 172.30.0.50:27017 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50604 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50590 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:59916 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50566 ESTABLISHED
tcp 0 0 172.30.0.7:44978 172.30.0.51:27017 ESTABLISHED
tcp 0 0 172.30.0.7:36776 172.30.0.50:27017 ESTABLISHED
tcp 0 0 127.0.1.1:27017 127.0.0.1:54452 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:59922 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50558 ESTABLISHED
tcp 0 0 127.0.0.1:54468 127.0.1.1:27017 ESTABLISHED
tcp 0 0 127.0.0.1:40132 127.0.1.1:27017 ESTABLISHED
tcp 0 36 172.30.0.7:22 172.20.17.42:52164 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:59910 ESTABLISHED
tcp 0 0 127.0.0.1:54452 127.0.1.1:27017 ESTABLISHED
tcp 0 0 172.30.0.7:44992 172.30.0.51:27017 ESTABLISHED
tcp 0 0 172.30.0.7:43638 172.30.0.50:27017 ESTABLISHED
tcp 0 0 172.30.0.7:43618 172.30.0.50:27017 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.6:42204 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.6:42216 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50620 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:50834 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50582 ESTABLISHED
tcp 0 0 172.30.0.7:45008 172.30.0.51:27017 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.50:50586 ESTABLISHED
tcp 0 0 172.30.0.7:27017 172.30.0.51:59896 ESTABLISHED
tcp 0 0 172.30.0.7:45022 172.30.0.51:27017 ESTABLISHED
tcp 0 0 127.0.1.1:27017 127.0.0.1:40132 ESTABLISHED
One thing of interest is that the READ and UPDATE latency/IOPS of the MongoDB workload are still on par with the previous runs. INSERT is the only type of transaction that has degraded. I think crucible is mostly cleared as the source of issue so I'm moving this issue into omicron.
Also dumping the otpe stats here although I haven't observed anything out of the ordinary.
BRM42220031 # opteadm dump-layer -p opte7 nat
Port opte7 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
6 10 13 inner.ip.dst=172.20.26.17 "Stateful: 172.30.0.6 <=> (external)"
DEF -- 414 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
18 10 34 inner.ether.ether_type=IPv4 "Stateful: 172.30.0.6 <=> 172.20.26.17"
meta: router-target=ig=a4361b0b-b461-4674-a9ea-80296755f302
19 100 0 inner.ether.ether_type=IPv4 "Stateful: 172.20.26.13:16384-32767"
meta: router-target=ig=a4361b0b-b461-4674-a9ea-80296755f302
20 255 0 meta: router-target-class=ig "Deny"
DEF -- 477 -- "allow"
BRM42220031 # opteadm dump-layer -p opte7 router
Port opte7 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
DEF -- 534 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
4 27 0 inner.ip.dst=192.168.96.0/24 "Meta: Target = Subnet: 192.168.96.0/24"
2 27 0 inner.ip.dst=192.168.64.0/24 "Meta: Target = Subnet: 192.168.64.0/24"
0 27 0 inner.ip.dst=192.168.32.0/24 "Meta: Target = Subnet: 192.168.32.0/24"
3 31 477 inner.ip.dst=172.30.0.0/22 "Meta: Target = Subnet: 172.30.0.0/22"
5 75 120 inner.ip.dst=0.0.0.0/0 "Meta: Target = IG(Some(a4361b0b-b461-4674-a9ea-80296755f302))"
9 139 0 inner.ip6.dst=fddb:bb4e:6c24:62cc::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:62cc::/64"
7 139 0 inner.ip6.dst=fddb:bb4e:6c24:7631::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:7631::/64"
6 139 0 inner.ip6.dst=fddb:bb4e:6c24:168c::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:168c::/64"
1 139 0 inner.ip6.dst=fddb:bb4e:6c24::/64 "Meta: Target = Subnet: fddb:bb4e:6c24::/64"
8 267 0 inner.ip6.dst=::/0 "Meta: Target = IG(Some(a4361b0b-b461-4674-a9ea-80296755f302))"
DEF -- 0 -- "deny"
BRM42220017 # opteadm dump-layer -p opte8 nat
Port opte8 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
13 10 1 inner.ip.dst=172.20.26.195 "Stateful: 172.30.0.7 <=> (external)"
DEF -- 12221 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
39 10 10 inner.ether.ether_type=IPv4 "Stateful: 172.30.0.7 <=> 172.20.26.195"
meta: router-target=ig=a4361b0b-b461-4674-a9ea-80296755f302
40 100 0 inner.ether.ether_type=IPv4 "Stateful: 172.20.26.192:32768-49151"
meta: router-target=ig=a4361b0b-b461-4674-a9ea-80296755f302
41 255 0 meta: router-target-class=ig "Deny"
DEF -- 12237 -- "allow"
BRM42220017 # opteadm dump-layer -p opte8 router
Port opte8 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Outbound Flows
----------------------------------------------------------------------
PROTO SRC IP SPORT DST IP DPORT HITS ACTION
Inbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
DEF -- 12362 -- "allow"
Outbound Rules
----------------------------------------------------------------------
ID PRI HITS PREDICATES ACTION
8 27 0 inner.ip.dst=192.168.96.0/24 "Meta: Target = Subnet: 192.168.96.0/24"
4 27 0 inner.ip.dst=192.168.64.0/24 "Meta: Target = Subnet: 192.168.64.0/24"
2 27 0 inner.ip.dst=192.168.32.0/24 "Meta: Target = Subnet: 192.168.32.0/24"
0 31 12257 inner.ip.dst=172.30.0.0/22 "Meta: Target = Subnet: 172.30.0.0/22"
3 75 122 inner.ip.dst=0.0.0.0/0 "Meta: Target = IG(Some(a4361b0b-b461-4674-a9ea-80296755f302))"
7 139 0 inner.ip6.dst=fddb:bb4e:6c24:62cc::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:62cc::/64"
6 139 0 inner.ip6.dst=fddb:bb4e:6c24:168c::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:168c::/64"
5 139 0 inner.ip6.dst=fddb:bb4e:6c24::/64 "Meta: Target = Subnet: fddb:bb4e:6c24::/64"
1 139 0 inner.ip6.dst=fddb:bb4e:6c24:7631::/64 "Meta: Target = Subnet: fddb:bb4e:6c24:7631::/64"
9 267 0 inner.ip6.dst=::/0 "Meta: Target = IG(Some(a4361b0b-b461-4674-a9ea-80296755f302))"
DEF -- 0 -- "deny"
A certain workload I've been using for release-to-release performance comparison shows major degradation. The workload comprises a load generator running YCSB and a MongoDB clusters with 3 nodes. They are located on 4 different sleds and the traffic among them is confined to the VPC they are on:
These were the typical rates of INSERT previously, on omicron commit
e7d32ae2375b0231193f1dc84271f900915b2d6b
(the workload took no more than 3 mins to complete):The same workload on omicron commit
41d7c9b0c110e6d3690bf96bb969b74f8c385bf6
runs more than 40 times slower (it's been running for two hours and still hasn't completed):I ran a fio regression test and the disk I/O numbers are roughly the same between the two commits. I'll check the VPC network throughput next to see if there is any change .