microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
9.89k stars 1.45k forks source link

Improve `search`/`find_end` perf by dropping `memcmp` #4654

Closed AlexGuteniev closed 1 week ago

AlexGuteniev commented 1 month ago

Resolves #2453

These benchmark results are no longer relevant, as the PR intention has changed Before: ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- c_strstr 184 ns 119 ns 8960000 ranges_search 6276 ns 3899 ns 344615 ranges_search 6358 ns 3625 ns 331852 ranges_search 5652 ns 3692 ns 431622 ranges_search 7510 ns 4269 ns 373333 search_default_searcher 1720 ns 921 ns 1120000 search_default_searcher 2497 ns 1438 ns 1000000 search_default_searcher 2224 ns 1001 ns 1280000 search_default_searcher 2762 ns 1297 ns 1000000 ``` After: ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- c_strstr 183 ns 102 ns 10000000 ranges_search 678 ns 355 ns 4266667 ranges_search 1285 ns 698 ns 1723077 ranges_search 2317 ns 1111 ns 1280000 ranges_search 4736 ns 2537 ns 597333 search_default_searcher 669 ns 353 ns 4072727 search_default_searcher 1293 ns 582 ns 1906383 search_default_searcher 2261 ns 1400 ns 814545 search_default_searcher 4756 ns 2511 ns 497778 ```

strstr is given for a reference in the benchmark, it is not affected by the optimization.

It may be impossible to reach strstr performance, as it uses pcmpistri (and reading beyond the last element, as pcmpistri is not very useful otherwise). We can try pcmpestri for 8-bit and 16-bit cases, but still it may be not as efficient, as strstr. I'd prefer to try this additional optimization in a next PR though.

AlexGuteniev commented 1 month ago

The difference in "before" results between ranges::search and search(..., default_searcher) is due to different implementation. search(..., default_searcher) doesn't try to use memcmp on every iteration. It is direct comparison. Apparently, this is faster.

It is curious that search(..., default_searcher) is faster for 64-bit type before vectorization.

I would want someone else to confirm the results, and maintainers decision what to do with this.

AlexGuteniev commented 1 month ago

A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search case (along with the usual std::search) would be the same as search(..., default_searcher)

AlexGuteniev commented 1 month ago

And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri smells like it would be faster.

StephanTLavavej commented 2 weeks ago

I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp. Stay tuned...

StephanTLavavej commented 2 weeks ago

Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp "optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:

Benchmark main Vector Plain
c_strstr 144 ns 151 ns 145 ns
classic_search<std::uint8_t> 8935 ns 1726 ns 1754 ns
classic_search<std::uint16_t> 9732 ns 3017 ns 1739 ns
classic_search<std::uint32_t> 9029 ns 4829 ns 1912 ns
classic_search<std::uint64_t> 8970 ns 8471 ns 2527 ns
ranges_search<std::uint8_t> 8916 ns 1723 ns 1784 ns
ranges_search<std::uint16_t> 8932 ns 3012 ns 1785 ns
ranges_search<std::uint32_t> 8303 ns 4840 ns 2460 ns
ranges_search<std::uint64_t> 9061 ns 8577 ns 1860 ns
search_default_searcher<std::uint8_t> 1858 ns 1720 ns 1883 ns
search_default_searcher<std::uint16_t> 2525 ns 3016 ns 1861 ns
search_default_searcher<std::uint32_t> 1863 ns 4812 ns 1869 ns
search_default_searcher<std::uint64_t> 1873 ns 8440 ns 2592 ns

This compares main (with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp.

There's some noise here (e.g. search_default_searcher is unchanged between main and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.

Note that dropping the memcmp paths from _Equal_rev_pred_unchecked and _Equal_rev_pred has very limited impact. _Equal_rev_pred_unchecked is called by classic/parallel search/find_end, and _Equal_rev_pred is called by ranges search/find_end. (This makes sense, since find_end is the "opposite" of search.)

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial). It's search/find_end that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.

In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.

AlexGuteniev commented 2 weeks ago

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial)

Agreed. mismatch / lexicographical_compare did benefit from similar optimization.

StephanTLavavej commented 2 weeks ago

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej commented 1 week ago

Thanks for investigating and improving the performance here even if the final result was very different from the initial vision! :crystal_ball: :magic_wand: :rocket: