Improved Encoder API - Githubissues

Barinzaya commented 1 year ago

This PR reworks the implementation of the encoder module to accomplish 2 goals:

Improve the performance of the encoder logic. The existing implementation performs a lot of small allocations due to creating and passing around around small Vec and String objects internally. The reworked implementation passes through a reference to the output, and all data is written directly to the output; no internal buffers are allocated, making performance much faster. The performance improvement from this is significant, improving performance by 3-50x in my results from stress-test benchmarks that I created alongside these changes (sample results below). In a practical project in which I'm using rosc, these changes have decreased the amount of time spent encoding OSC packets to about 1/8 of what it was before (from ~150 us per packet to ~18 us) with no code changes in that project.
Make the output more flexible. This PR adds an Output trait which allows the user control over what is done with the output. Output is implemented for Vec<u8>, allowing for a packet to easily be encoded into a Vec. For std, there is also a WriteOutput newtype which allows any std::io::Seek + std::io::Write type to be used as an Output, allowing OSC data to be encoded directly to a Cursor, File, etc. These changes also include encode_into and encode_string_into functions, which allow the user to pass in a mutable reference to any object implementing this Output trait, allowing the user to control how the encoded data is handled.

The existing public functions have not had their signatures altered, only new public items have been added. Thus, according to the SemVer Compatibility guide, these changes should only qualify as a minor change. Additionally, these changes do not incur a MSRV bump (1.52 is the minimum supported version according to cargo-msrv, both with and without these changes). Thus, these changes don't require a major version bump.

This also fixes unit tests not building without the std feature. All existing unit tests are passing, both with and without std. No additional unit tests have been added.

Benchmark results on my machine for comparison:

Benchmark	`master` (Windows)	`encoder-improvements` (Windows)	Relative Speed (Windows)	`master` (Linux)	`encoder-improvements` (Linux)	Relative Speed (Linux)
bench_encode_args_array	90,150 ± 4,137	5,929 ± 51	15.2x	39,945 ± 615	4,610 ± 52	8.7x
bench_encode_args_blob	73,101 ± 989	8,940 ± 69	8.2x	32,807 ± 201	9,931 ± 85	3.3x
bench_encode_args_bool	37,263 ± 665	4,932 ± 107	7.6x	16,431 ± 179	4,846 ± 50	3.4x
bench_encode_args_double	69,613 ± 1,721	5,506 ± 39	12.6x	29,751 ± 131	5,366 ± 22	5.5x
bench_encode_args_float	69,645 ± 1,475	5,413 ± 22	12.9x	29,924 ± 336	4,974 ± 15	6.0x
bench_encode_args_int	70,911 ± 6,504	5,751 ± 43	12.3x	29,742 ± 280	4,264 ± 12	7.0x
bench_encode_args_long	70,653 ± 1,034	5,425 ± 46	13.0x	29,634 ± 88	5,064 ± 38	5.9x
bench_encode_args_nil	37,474 ± 918	4,983 ± 47	7.5x	16,188 ± 110	4,797 ± 18	3.4x
bench_encode_args_string	141,297 ± 2,025	9,083 ± 50	13.9x	46,884 ± 108	9,801 ± 34	4.8x
bench_encode_bundles	294,400 ± 23,868	5,844 ± 48	50.4x	99,883 ± 1,409	4,639 ± 16	21.5x
bench_encode_bundles_into_new_vec	-	6,825 ± 70	-	-	6,159 ± 281	-
bench_encode_bundles_into_reused_vec	-	5,282 ± 52	-	-	6,152 ± 57	-
bench_encode_huge_bundle	2,932,320 ± 91,668	562,780 ± 11,300	5.2x	1,270,594 ± 8,991	121,758 ± 1,017	10.4x
bench_encode_huge_bundle_into_new_vec	-	565,861 ± 13,714	-	-	131,854 ± 862	-
bench_encode_huge_bundle_into_reused_vec	-	137,702 ± 713	-	-	132,321 ± 832	-
bench_encode_messages	374,490 ± 3,665	10,903 ± 241	34.3x	135,246 ± 949	9,637 ± 41	14.0x
bench_encode_messages_into_new_vec	-	11,988 ± 55	-	-	11,539 ± 61	-
bench_encode_messages_into_reused_vec	-	10,386 ± 54	-	-	11,355 ± 53	-
bench_encode_nested_bundles	8,009 ± 136	640 ± 25	12.5x	2,842 ± 25	240 ± 6	11.8x

I find the discrepancy between the encode benchmarks and the into_new_vec benchmarks interesting, because they should be doing almost the exact same thing, and yet the encode benchmark is noticeably faster in most cases (except for the huge benchmark).

Barinzaya commented 1 year ago

Sure thing! Let me know if you have any comments or suggestions.

klingtnet commented 1 year ago

@Barinzaya That's awesome work you did there 🐎 !

I'll try to find some time in the next days, maybe next week, to get a deeper look into the changes and give it a proper review. Thanks in advance!

Edit: I deleted the original comment, because I accidentally made that using my work account.

Barinzaya commented 1 year ago

I simplified the encode_args benchmarks to simplify the packets being encoded and focus it more on the arguments (was bundle(1 message(args)), now just message(args)). For some reason, this seems to have improved the performance on the branch further (expected, since there are fewer steps to the encoding), but for some reason didn't seem to have any effect on the benchmark time on master, much to my surprise. This makes the improvement numbers look nicer, but also makes me wonder if the benchmarks I added are missing some case that accounts for that difference.

Just wanted to make an open note of it, since it gives me the feeling of fluffing the numbers, which wasn't the intention at all.

klingtnet commented 1 year ago

@Barinzaya Can you share the command (if there is some) that you used to create the benchmark table from https://github.com/klingtnet/rosc/pull/44#issue-1617160005 ?

Did you use https://github.com/BurntSushi/cargo-benchcmp ?

Update

I tried cargo-benchcmp and what I did was

$ git rebase --exec 'cargo bench | tee bench-$(git rev-parse HEAD)' origin/master

to run a benchmark for each commit. Afterwards you can do something like

$ cargo benchcmp 3146ce4cf539bf88e638261ee1d6bd9c104420aa fabeccd7d73c8416706804007acb0e0d89c6f673 --regressions

to check for regressions. However, the benchmark could not be run for all of the commits, so I think we might need to squash some of them. And, I couldn't reproduce similar speedups to yours, yet.

Barinzaya commented 1 year ago

As far as the benchmark comparison table, I created them manually. Ran cargo bench on master and copied the output, then switched to encoder-improvements and ran it again and copied its outputs. I made a copy of the encoder_bench file and removed the encode_into tests for use on master. The relative speeds were also done manually (old time / new time).

I probably could have at least used a spreadsheet or something, but there were only a few benchmarks when I started and I never bothered to set one up. There might be other, more suitable tools out there too.

Barinzaya commented 1 year ago

Commits have been squashed.

Unsquashed commits are still available in the encoder-improvements+unsquashed branch on the fork's repository.

klingtnet commented 1 year ago

Looks good to me. I'm going to merge this in the next few days and then prepare a release. @Barinzaya Good work 👍🏻

Barinzaya commented 1 year ago

Added one more commit, which adds an encoder unit test using encode_into to write to a Vec via a Cursor, using the WriteOutput struct. I feel this is somewhat important, as Output implementation for WriteOutput is not covered by the existing unit tests.

That makes this 5 commits, one per file touched. I can squash them down a bit further (e.g. combine the commit for the unit test change with the commit for the encoder changes) if preferred.

klingtnet / rosc

Improved Encoder API #44