longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.11k stars 599 forks source link

[IMPROVEMENT] Reduce syscalls while reading and writing requests in longhorn-engine (engine <-> replica) #4122

Closed derekbit closed 2 years ago

derekbit commented 2 years ago

What's the task? Please describe

Reduce syscalls while reading and writing requests in longhorn-engine

Describe the items of the task (DoD, definition of done) you'd like

Writing or reading a header of a request calls multiple syscalls, which increase the latency of the data io. The task is to reduce the number of the syscalls and the memory allocations.

https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L25 https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L52

From the benchmarking result, the read/write latencies decrease by 5-10% and the write bandwidth increases by ~3%.

Additional context

Add any other context or screenshots about the task request here.

longhorn-io-github-bot commented 2 years ago

Pre Ready-For-Testing Checklist

derekbit commented 2 years ago

Performance update image

keithalucas commented 2 years ago

We could reduce the system calls in https://github.com/rancher/liblonghorn/blob/master/src/longhorn_rpc_protocol.c as well. The go code uses buffered I/O to reduce the number of syscalls as well. In C we could make our own buffering or use fread and fwrite. If this benefits the longhorn-engine, doing it in the liblonghorn should probably cause an improvement as well.

yangchiu commented 2 years ago

run on AWS with c5d.2xlarge instances

longhorn-engine master is at Reduce write and read calls while processing requests (commit id: 4abd8ae)

replica = 1

    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 1) rand 19,967 13,646 362,171 161,959 535,450 520,486
  seq 34,761 26,139 362,232 164,389 555,369 538,585
master (replica = 1) rand 20,416 19,070 362,190 161,951 525,037 468,895
  seq 35,260 27,415 362,228 164,372 495,173 471,372
Improvement Percentage rand 2.25% 39.75% 0.01% 0% 1.94% 9.91%
  seq 1.44% 4.88% 0% -0.01% 10.84% 12.48%

replica = 2

    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 2) rand 19,850 15,883 734,019 159,977 441,971 428,192
  seq 32,468 26,577 595,618 164,425 449,099 418,352
master (replica = 2) rand 21,425 12,689 740,504 159,983 504,160 525,191
  seq 34,191 21,177 621,723 164,384 507,674 500,220
Improvement Percentage rand 7.93% -20.11% 0.88% 0% -14.07% -22.65%
  seq 5.31% -20.32% 4.38% -0.02% -13.04% -19.57%

replica = 3

    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 3) rand 22,026 12,668 879,245 161,959 417,503 400,442
  seq 35,995 21,938 587,617 164,391 433,612 415,655
master (replica = 3) rand 21,101 12,735 769,681 158,520 488,201 528,469
  seq 35,198 21,942 493,682 167,881 486,001 538,332
Improvement Percentage rand -4.2% 0.53% -12.46% -2.12% -16.93% -31.97%
  seq -2.21% 0.02% -15.99% 2.12% -12.08% -29.51%

For replica = 1, it has better performance as expected, but for replica = 2 or 3, it somehow has worse performance. I'll rerun the test again to confirm.

derekbit commented 2 years ago

Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?

yangchiu commented 2 years ago

Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?

Yes, c5d.2xlarge has a 200GB NVMe device.

Get reasonable results after rerun the tests on Equinix c3.small.x86 instances (3 nodes cluster, kbench test size = 30G)

For replica count < cluster size, extra nodes should be cordoned to let the volume and replicas fixed on the same node so we can got the reasonable results, which is missed in the current longhorn-benchmark-test, another ticket is opened to track this.

    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 1) rand 30,705 6,085 314,128 325,342 302,825 374,126
  seq 52,962 12,235 468,951 392,869 231,370 376,705
master (replica = 1) rand 31,023 6,776 309,539 320,136 299,175 363,420
  seq 54,408 13,240 476,167 390,982 231,119 363,423
Improvement Percentage rand 1.04% 11.36% -1.46% -1.6% 1.21% 2.86%
  seq 2.73% 8.21% 1.54% -0.48% 0.11% 3.53%
    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 2) rand 32,493 6,931 496,584 346,184 1,630,497 1,476,661
  seq 59,548 13,374 654,711 318,191 1,656,486 668,856
master (replica = 2) rand 31,070 6,763 509,312 348,053 1,624,069 537,037
  seq 57,462 13,834 668,694 367,917 1,643,925 628,580
Improvement Percentage rand -4.38% -2.42% 2.56% 0.54% 0.39% 63.63%
  seq -3.5% 3.44% 2.14% 15.63% 0.76% 6.02%
    IOPS   Bandwidth (KiB/s)   Latency (ns)  
    read write read write read write
v1.3.0 (replica = 3) rand 31,407 6,310 599,436 343,027 1,876,947 1,803,525
  seq 60,649 13,312 764,964 268,320 1,864,948 1,832,466
master (replica = 3) rand 30,842 6,875 609,687 344,434 1,849,015 1,821,225
  seq 56,611 13,608 784,128 285,779 1,820,308 1,840,424
Improvement Percentage rand -1.8% 8.95% 1.71% 0.41% 1.49% -0.98%
  seq -6.66% 2.22% 2.51% 6.51% 2.39% -0.43%