Use loole to improve channel performance

mahdi-shojaee commented 7 months ago

Hello, this pull request replaces Flume with Loole, my recent safe implementation of MPMC channel that has successfully passed all Flume tests for all implemented APIs. Consequently, both channel-related benchmarks in read.rs have demonstrated the following improvements:

# Using flume
test read_single_image_rle_all_channels               ... bench:  14,087,088 ns/iter (+/- 1,028,918)
test read_single_image_rle_non_parallel_all_channels  ... bench:  22,124,520 ns/iter (+/- 148,387)

# Using loole
test read_single_image_rle_all_channels               ... bench:   7,956,153 ns/iter (+/- 748,611)
test read_single_image_rle_non_parallel_all_channels  ... bench:  16,360,325 ns/iter (+/- 77,868)

Loole's GitHub: https://github.com/mahdi-shojaee/loole

Thanks

johannesvollmer commented 7 months ago

Very interesting. I'm curious how it's possible to be faster than flume, is there something I can read about how loole achieves this awesome performance? :)

mahdi-shojaee commented 7 months ago

Thanks, not yet. Currently, the only source of information is the source code itself. However, to summarize, some potential reasons for the performance difference could include the absence of cloning, minimal abstraction, minimal moves, and possibly a more efficient algorithm (although I haven't reviewed Flume's source code). In the future, it's conceivable that sync operations (send and recv) could be further optimized by enhancing the signal implementation.

johannesvollmer commented 7 months ago

Cool thanks :) I'll have a closer look at this tomorrow hopefully. Thanks so far :)

mahdi-shojaee commented 7 months ago

You are welcome.

mahdi-shojaee commented 6 months ago

Hello, thank you for reviewing my pull request. I appreciate your feedback. However, I have discovered that there is an issue with the benchmark code that causes inconsistent and unreliable results. Even using flume itself, if you run the benchmark repeatedly it shows randomly faster or slower results. This also happens using loole. Thanks again.

mahdi-shojaee commented 6 months ago

If this is correct, I think this PR is no longer valid.

johannesvollmer commented 6 months ago

Hmm interesting. Sorry, I didn't yet have the time to look into this PR, please accept by apologies. I'd like to run both benchmarks myself and I'll report what happens on my machine

johannesvollmer commented 6 months ago

-- loole -- 
test read_f16_as_f16_uncompressed_1thread ... bench:   8,388,010 ns/iter (+/- 3,397,072)
test read_f16_as_f16_zip_1thread          ... bench:  17,489,190 ns/iter (+/- 3,253,988)
test read_f16_as_f16_zip_nthreads         ... bench:   9,099,960 ns/iter (+/- 3,757,732)
test read_f16_as_f32_uncompressed_1thread ... bench:   8,234,400 ns/iter (+/- 3,470,654)
test read_f16_as_f32_zip_1thread          ... bench:  17,849,420 ns/iter (+/- 1,432,790)
test read_f16_as_f32_zip_nthreads         ... bench:   8,874,635 ns/iter (+/- 3,389,003)
test read_f16_as_u32_uncompressed_1thread ... bench:   8,163,710 ns/iter (+/- 2,859,007)
test read_f32_as_f16_uncompressed_1thread ... bench:  29,717,520 ns/iter (+/- 2,255,945)
test read_f32_as_f16_zips_1thread         ... bench:  87,762,410 ns/iter (+/- 2,405,089)
test read_f32_as_f16_zips_nthreads        ... bench:  32,962,650 ns/iter (+/- 2,531,739)
test read_f32_as_f32_uncompressed_1thread ... bench:  29,741,710 ns/iter (+/- 2,462,639)
test read_f32_as_f32_zips_1thread         ... bench:  88,301,830 ns/iter (+/- 1,456,787)
test read_f32_as_f32_zips_nthreads        ... bench:  33,265,970 ns/iter (+/- 1,701,066)
test read_f32_as_u32_uncompressed_1thread ... bench:  29,956,220 ns/iter (+/- 2,818,988)

-- flume --
test read_f16_as_f16_uncompressed_1thread ... bench:   8,104,510 ns/iter (+/- 3,180,752)
test read_f16_as_f16_zip_1thread          ... bench:  17,785,760 ns/iter (+/- 2,046,426)
test read_f16_as_f16_zip_nthreads         ... bench:   9,142,180 ns/iter (+/- 3,819,020)
test read_f16_as_f32_uncompressed_1thread ... bench:   8,028,740 ns/iter (+/- 2,876,322)
test read_f16_as_f32_zip_1thread          ... bench:  17,656,880 ns/iter (+/- 1,436,841)
test read_f16_as_f32_zip_nthreads         ... bench:   8,718,960 ns/iter (+/- 3,392,016)
test read_f16_as_u32_uncompressed_1thread ... bench:   8,115,830 ns/iter (+/- 2,718,979)
test read_f32_as_f16_uncompressed_1thread ... bench:  29,787,260 ns/iter (+/- 2,147,523)
test read_f32_as_f16_zips_1thread         ... bench:  87,121,620 ns/iter (+/- 3,145,823)
test read_f32_as_f16_zips_nthreads        ... bench:  33,109,410 ns/iter (+/- 2,087,108)
test read_f32_as_f32_uncompressed_1thread ... bench:  30,033,080 ns/iter (+/- 40,816,735)
test read_f32_as_f32_zips_1thread         ... bench:  88,689,050 ns/iter (+/- 6,816,531)
test read_f32_as_f32_zips_nthreads        ... bench:  33,697,520 ns/iter (+/- 9,957,318)
test read_f32_as_u32_uncompressed_1thread ... bench:  30,318,000 ns/iter (+/- 4,575,068)

These are the numbers on my machine. Your insights seem plausible, there is no difference. The concurrency is most likely not the bottleneck, but the pixel data processing on the threads is the bottleneck.

mahdi-shojaee commented 6 months ago

That makes sense, thanks for clarifying.

johannesvollmer / exrs

Use loole to improve channel performance #228