Closed derekbit closed 2 years ago
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
[ ] Is there a workaround for the issue? If so, where is it documented?
The workaround is at:
[x] Does the PR include the explanation for the fix or the feature?
[ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:
[x] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*
)?
The PR is at https://github.com/longhorn/longhorn-engine/pull/708
[x] Which areas/issues this PR might have potential impacts on? Area: datapath performance Issues
[ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
The LEP PR is at
[ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including
The UI issue/PR is atbackport-needed/*
)?
[ ] If labeled: require/doc Has the necessary document PR submitted or merged (including
The documentation issue/PR is atbackport-needed/*
)?
[ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including
The automation skeleton PR is at
The automation test case PR is at
The issue of automation test case implementation is at (please create by the template)backport-needed/*
)
[ ] If labeled: require/automation-engine Has the engine integration test been merged (including
The engine automation PR is atbackport-needed/*
)?
[ ] If labeled: require/manual-test-plan Has the manual test plan been documented?
The updated manual test plan is at
[ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label
The compatibility issue is filed atrelease/obsolete-compatibility
?
Performance update
We could reduce the system calls in https://github.com/rancher/liblonghorn/blob/master/src/longhorn_rpc_protocol.c as well. The go code uses buffered I/O to reduce the number of syscalls as well. In C we could make our own buffering or use fread
and fwrite
. If this benefits the longhorn-engine, doing it in the liblonghorn should probably cause an improvement as well.
run on AWS with c5d.2xlarge instances
longhorn-engine master is at Reduce write and read calls while processing requests (commit id: 4abd8ae)
replica = 1
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 1) | rand | 19,967 | 13,646 | 362,171 | 161,959 | 535,450 | 520,486 |
seq | 34,761 | 26,139 | 362,232 | 164,389 | 555,369 | 538,585 | |
master (replica = 1) | rand | 20,416 | 19,070 | 362,190 | 161,951 | 525,037 | 468,895 |
seq | 35,260 | 27,415 | 362,228 | 164,372 | 495,173 | 471,372 | |
Improvement Percentage | rand | 2.25% | 39.75% | 0.01% | 0% | 1.94% | 9.91% |
seq | 1.44% | 4.88% | 0% | -0.01% | 10.84% | 12.48% |
replica = 2
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 2) | rand | 19,850 | 15,883 | 734,019 | 159,977 | 441,971 | 428,192 |
seq | 32,468 | 26,577 | 595,618 | 164,425 | 449,099 | 418,352 | |
master (replica = 2) | rand | 21,425 | 12,689 | 740,504 | 159,983 | 504,160 | 525,191 |
seq | 34,191 | 21,177 | 621,723 | 164,384 | 507,674 | 500,220 | |
Improvement Percentage | rand | 7.93% | -20.11% | 0.88% | 0% | -14.07% | -22.65% |
seq | 5.31% | -20.32% | 4.38% | -0.02% | -13.04% | -19.57% |
replica = 3
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 3) | rand | 22,026 | 12,668 | 879,245 | 161,959 | 417,503 | 400,442 |
seq | 35,995 | 21,938 | 587,617 | 164,391 | 433,612 | 415,655 | |
master (replica = 3) | rand | 21,101 | 12,735 | 769,681 | 158,520 | 488,201 | 528,469 |
seq | 35,198 | 21,942 | 493,682 | 167,881 | 486,001 | 538,332 | |
Improvement Percentage | rand | -4.2% | 0.53% | -12.46% | -2.12% | -16.93% | -31.97% |
seq | -2.21% | 0.02% | -15.99% | 2.12% | -12.08% | -29.51% |
For replica = 1, it has better performance as expected, but for replica = 2 or 3, it somehow has worse performance. I'll rerun the test again to confirm.
Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?
Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?
Yes, c5d.2xlarge has a 200GB NVMe device.
Get reasonable results after rerun the tests on Equinix c3.small.x86 instances (3 nodes cluster, kbench test size = 30G)
For replica count < cluster size, extra nodes should be cordoned to let the volume and replicas fixed on the same node so we can got the reasonable results, which is missed in the current longhorn-benchmark-test, another ticket is opened to track this.
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 1) | rand | 30,705 | 6,085 | 314,128 | 325,342 | 302,825 | 374,126 |
seq | 52,962 | 12,235 | 468,951 | 392,869 | 231,370 | 376,705 | |
master (replica = 1) | rand | 31,023 | 6,776 | 309,539 | 320,136 | 299,175 | 363,420 |
seq | 54,408 | 13,240 | 476,167 | 390,982 | 231,119 | 363,423 | |
Improvement Percentage | rand | 1.04% | 11.36% | -1.46% | -1.6% | 1.21% | 2.86% |
seq | 2.73% | 8.21% | 1.54% | -0.48% | 0.11% | 3.53% |
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 2) | rand | 32,493 | 6,931 | 496,584 | 346,184 | 1,630,497 | 1,476,661 |
seq | 59,548 | 13,374 | 654,711 | 318,191 | 1,656,486 | 668,856 | |
master (replica = 2) | rand | 31,070 | 6,763 | 509,312 | 348,053 | 1,624,069 | 537,037 |
seq | 57,462 | 13,834 | 668,694 | 367,917 | 1,643,925 | 628,580 | |
Improvement Percentage | rand | -4.38% | -2.42% | 2.56% | 0.54% | 0.39% | 63.63% |
seq | -3.5% | 3.44% | 2.14% | 15.63% | 0.76% | 6.02% |
IOPS | Bandwidth (KiB/s) | Latency (ns) | |||||
---|---|---|---|---|---|---|---|
read | write | read | write | read | write | ||
v1.3.0 (replica = 3) | rand | 31,407 | 6,310 | 599,436 | 343,027 | 1,876,947 | 1,803,525 |
seq | 60,649 | 13,312 | 764,964 | 268,320 | 1,864,948 | 1,832,466 | |
master (replica = 3) | rand | 30,842 | 6,875 | 609,687 | 344,434 | 1,849,015 | 1,821,225 |
seq | 56,611 | 13,608 | 784,128 | 285,779 | 1,820,308 | 1,840,424 | |
Improvement Percentage | rand | -1.8% | 8.95% | 1.71% | 0.41% | 1.49% | -0.98% |
seq | -6.66% | 2.22% | 2.51% | 6.51% | 2.39% | -0.43% |
What's the task? Please describe
Reduce syscalls while reading and writing requests in longhorn-engine
Describe the items of the task (DoD, definition of done) you'd like
Writing or reading a header of a request calls multiple syscalls, which increase the latency of the data io. The task is to reduce the number of the syscalls and the memory allocations.
https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L25 https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L52
From the benchmarking result, the read/write latencies decrease by 5-10% and the write bandwidth increases by ~3%.
Additional context
Add any other context or screenshots about the task request here.