Improve `search`/`find_end` perf by dropping `memcmp`

AlexGuteniev commented 1 month ago

Resolves #2453

These benchmark results are no longer relevant, as the PR intention has changed

Before: ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- c_strstr 184 ns 119 ns 8960000 ranges_search 6276 ns 3899 ns 344615 ranges_search 6358 ns 3625 ns 331852 ranges_search 5652 ns 3692 ns 431622 ranges_search 7510 ns 4269 ns 373333 search_default_searcher 1720 ns 921 ns 1120000 search_default_searcher 2497 ns 1438 ns 1000000 search_default_searcher 2224 ns 1001 ns 1280000 search_default_searcher 2762 ns 1297 ns 1000000 ``` After: ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- c_strstr 183 ns 102 ns 10000000 ranges_search 678 ns 355 ns 4266667 ranges_search 1285 ns 698 ns 1723077 ranges_search 2317 ns 1111 ns 1280000 ranges_search 4736 ns 2537 ns 597333 search_default_searcher 669 ns 353 ns 4072727 search_default_searcher 1293 ns 582 ns 1906383 search_default_searcher 2261 ns 1400 ns 814545 search_default_searcher 4756 ns 2511 ns 497778 ```

strstr is given for a reference in the benchmark, it is not affected by the optimization.

It may be impossible to reach strstr performance, as it uses pcmpistri (and reading beyond the last element, as pcmpistri is not very useful otherwise). We can try pcmpestri for 8-bit and 16-bit cases, but still it may be not as efficient, as strstr. I'd prefer to try this additional optimization in a next PR though.

AlexGuteniev commented 1 month ago

The difference in "before" results between ranges::search and search(..., default_searcher) is due to different implementation. search(..., default_searcher) doesn't try to use memcmp on every iteration. It is direct comparison. Apparently, this is faster.

It is curious that search(..., default_searcher) is faster for 64-bit type before vectorization.

I would want someone else to confirm the results, and maintainers decision what to do with this.

AlexGuteniev commented 1 month ago

A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search case (along with the usual std::search) would be the same as search(..., default_searcher)

AlexGuteniev commented 1 month ago

And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri smells like it would be faster.

StephanTLavavej commented 2 weeks ago

I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp. Stay tuned...

StephanTLavavej commented 2 weeks ago

Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp "optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:

Benchmark	`main`	Vector	Plain
`c_strstr`	144 ns	151 ns	145 ns
`classic_search<std::uint8_t>`	8935 ns	1726 ns	1754 ns
`classic_search<std::uint16_t>`	9732 ns	3017 ns	1739 ns
`classic_search<std::uint32_t>`	9029 ns	4829 ns	1912 ns
`classic_search<std::uint64_t>`	8970 ns	8471 ns	2527 ns
`ranges_search<std::uint8_t>`	8916 ns	1723 ns	1784 ns
`ranges_search<std::uint16_t>`	8932 ns	3012 ns	1785 ns
`ranges_search<std::uint32_t>`	8303 ns	4840 ns	2460 ns
`ranges_search<std::uint64_t>`	9061 ns	8577 ns	1860 ns
`search_default_searcher<std::uint8_t>`	1858 ns	1720 ns	1883 ns
`search_default_searcher<std::uint16_t>`	2525 ns	3016 ns	1861 ns
`search_default_searcher<std::uint32_t>`	1863 ns	4812 ns	1869 ns
`search_default_searcher<std::uint64_t>`	1873 ns	8440 ns	2592 ns

This compares main (with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp.

There's some noise here (e.g. search_default_searcher is unchanged between main and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.

Note that dropping the memcmp paths from _Equal_rev_pred_unchecked and _Equal_rev_pred has very limited impact. _Equal_rev_pred_unchecked is called by classic/parallel search/find_end, and _Equal_rev_pred is called by ranges search/find_end. (This makes sense, since find_end is the "opposite" of search.)

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial). It's search/find_end that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.

In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.

AlexGuteniev commented 2 weeks ago

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial)

Agreed. mismatch / lexicographical_compare did benefit from similar optimization.

StephanTLavavej commented 2 weeks ago

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej commented 1 week ago

Thanks for investigating and improving the performance here even if the final result was very different from the initial vision! :crystal_ball: :magic_wand: :rocket:

microsoft / STL

Improve `search`/`find_end` perf by dropping `memcmp` #4654