[IMPROVEMENT] Reduce syscalls while reading and writing requests in longhorn-engine (engine <-> replica)

derekbit commented 2 years ago

What's the task? Please describe

Reduce syscalls while reading and writing requests in longhorn-engine

Describe the items of the task (DoD, definition of done) you'd like

Writing or reading a header of a request calls multiple syscalls, which increase the latency of the data io. The task is to reduce the number of the syscalls and the memory allocations.

https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L25 https://github.com/longhorn/longhorn-engine/blob/master/pkg/dataconn/wire.go#L52

From the benchmarking result, the read/write latencies decrease by 5-10% and the write bandwidth increases by ~3%.

Additional context

Add any other context or screenshots about the task request here.

longhorn-io-github-bot commented 2 years ago

Pre Ready-For-Testing Checklist

[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
~~[ ] Is there a workaround for the issue? If so, where is it documented?~~ The workaround is at:
[x] Does the PR include the explanation for the fix or the feature?
~~[ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?~~ The PR for the YAML change is at: The PR for the chart change is at:
[x] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)? The PR is at https://github.com/longhorn/longhorn-engine/pull/708
[x] Which areas/issues this PR might have potential impacts on? Area: datapath performance Issues
~~[ ] If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?~~ The LEP PR is at
~~[ ] If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?~~ The UI issue/PR is at
~~[ ] If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?~~ The documentation issue/PR is at
[ ] If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*) The automation skeleton PR is at The automation test case PR is at The issue of automation test case implementation is at (please create by the template)
~~[ ] If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?~~ The engine automation PR is at
~~[ ] If labeled: require/manual-test-plan Has the manual test plan been documented?~~ The updated manual test plan is at
~~[ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?~~ The compatibility issue is filed at

derekbit commented 2 years ago

Performance update

keithalucas commented 2 years ago

We could reduce the system calls in https://github.com/rancher/liblonghorn/blob/master/src/longhorn_rpc_protocol.c as well. The go code uses buffered I/O to reduce the number of syscalls as well. In C we could make our own buffering or use fread and fwrite. If this benefits the longhorn-engine, doing it in the liblonghorn should probably cause an improvement as well.

yangchiu commented 2 years ago

run on AWS with c5d.2xlarge instances

longhorn-engine master is at Reduce write and read calls while processing requests (commit id: 4abd8ae)

replica = 1

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 1)	rand	19,967	13,646	362,171	161,959	535,450	520,486
	seq	34,761	26,139	362,232	164,389	555,369	538,585
master (replica = 1)	rand	20,416	19,070	362,190	161,951	525,037	468,895
	seq	35,260	27,415	362,228	164,372	495,173	471,372
Improvement Percentage	rand	2.25%	39.75%	0.01%	0%	1.94%	9.91%
	seq	1.44%	4.88%	0%	-0.01%	10.84%	12.48%

replica = 2

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 2)	rand	19,850	15,883	734,019	159,977	441,971	428,192
	seq	32,468	26,577	595,618	164,425	449,099	418,352
master (replica = 2)	rand	21,425	12,689	740,504	159,983	504,160	525,191
	seq	34,191	21,177	621,723	164,384	507,674	500,220
Improvement Percentage	rand	7.93%	-20.11%	0.88%	0%	-14.07%	-22.65%
	seq	5.31%	-20.32%	4.38%	-0.02%	-13.04%	-19.57%

replica = 3

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 3)	rand	22,026	12,668	879,245	161,959	417,503	400,442
	seq	35,995	21,938	587,617	164,391	433,612	415,655
master (replica = 3)	rand	21,101	12,735	769,681	158,520	488,201	528,469
	seq	35,198	21,942	493,682	167,881	486,001	538,332
Improvement Percentage	rand	-4.2%	0.53%	-12.46%	-2.12%	-16.93%	-31.97%
	seq	-2.21%	0.02%	-15.99%	2.12%	-12.08%	-29.51%

For replica = 1, it has better performance as expected, but for replica = 2 or 3, it somehow has worse performance. I'll rerun the test again to confirm.

derekbit commented 2 years ago

Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?

yangchiu commented 2 years ago

Weird. Cannot imagine the huge change between the two versions. BTW, did you use the direct attached nvme device?

Yes, c5d.2xlarge has a 200GB NVMe device.

Get reasonable results after rerun the tests on Equinix c3.small.x86 instances (3 nodes cluster, kbench test size = 30G)

For replica count < cluster size, extra nodes should be cordoned to let the volume and replicas fixed on the same node so we can got the reasonable results, which is missed in the current longhorn-benchmark-test, another ticket is opened to track this.

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 1)	rand	30,705	6,085	314,128	325,342	302,825	374,126
	seq	52,962	12,235	468,951	392,869	231,370	376,705
master (replica = 1)	rand	31,023	6,776	309,539	320,136	299,175	363,420
	seq	54,408	13,240	476,167	390,982	231,119	363,423
Improvement Percentage	rand	1.04%	11.36%	-1.46%	-1.6%	1.21%	2.86%
	seq	2.73%	8.21%	1.54%	-0.48%	0.11%	3.53%

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 2)	rand	32,493	6,931	496,584	346,184	1,630,497	1,476,661
	seq	59,548	13,374	654,711	318,191	1,656,486	668,856
master (replica = 2)	rand	31,070	6,763	509,312	348,053	1,624,069	537,037
	seq	57,462	13,834	668,694	367,917	1,643,925	628,580
Improvement Percentage	rand	-4.38%	-2.42%	2.56%	0.54%	0.39%	63.63%
	seq	-3.5%	3.44%	2.14%	15.63%	0.76%	6.02%

		IOPS		Bandwidth (KiB/s)		Latency (ns)
		read	write	read	write	read	write
v1.3.0 (replica = 3)	rand	31,407	6,310	599,436	343,027	1,876,947	1,803,525
	seq	60,649	13,312	764,964	268,320	1,864,948	1,832,466
master (replica = 3)	rand	30,842	6,875	609,687	344,434	1,849,015	1,821,225
	seq	56,611	13,608	784,128	285,779	1,820,308	1,840,424
Improvement Percentage	rand	-1.8%	8.95%	1.71%	0.41%	1.49%	-0.98%
	seq	-6.66%	2.22%	2.51%	6.51%	2.39%	-0.43%

longhorn / longhorn