Closed AlexGuteniev closed 1 week ago
The difference in "before" results between ranges::search
and search(..., default_searcher)
is due to different implementation. search(..., default_searcher)
doesn't try to use memcmp
on every iteration. It is direct comparison. Apparently, this is faster.
It is curious that search(..., default_searcher)
is faster for 64-bit type before vectorization.
I would want someone else to confirm the results, and maintainers decision what to do with this.
A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search
case (along with the usual std::search
) would be the same as search(..., default_searcher)
And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri
smells like it would be faster.
I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp
. Stay tuned...
Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp
"optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:
Benchmark | main |
Vector | Plain |
---|---|---|---|
c_strstr |
144 ns | 151 ns | 145 ns |
classic_search<std::uint8_t> |
8935 ns | 1726 ns | 1754 ns |
classic_search<std::uint16_t> |
9732 ns | 3017 ns | 1739 ns |
classic_search<std::uint32_t> |
9029 ns | 4829 ns | 1912 ns |
classic_search<std::uint64_t> |
8970 ns | 8471 ns | 2527 ns |
ranges_search<std::uint8_t> |
8916 ns | 1723 ns | 1784 ns |
ranges_search<std::uint16_t> |
8932 ns | 3012 ns | 1785 ns |
ranges_search<std::uint32_t> |
8303 ns | 4840 ns | 2460 ns |
ranges_search<std::uint64_t> |
9061 ns | 8577 ns | 1860 ns |
search_default_searcher<std::uint8_t> |
1858 ns | 1720 ns | 1883 ns |
search_default_searcher<std::uint16_t> |
2525 ns | 3016 ns | 1861 ns |
search_default_searcher<std::uint32_t> |
1863 ns | 4812 ns | 1869 ns |
search_default_searcher<std::uint64_t> |
1873 ns | 8440 ns | 2592 ns |
This compares main
(with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp
.
There's some noise here (e.g. search_default_searcher
is unchanged between main
and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp
is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.
Note that dropping the memcmp
paths from _Equal_rev_pred_unchecked
and _Equal_rev_pred
has very limited impact. _Equal_rev_pred_unchecked
is called by classic/parallel search
/find_end
, and _Equal_rev_pred
is called by ranges search
/find_end
. (This makes sense, since find_end
is the "opposite" of search
.)
The equal
etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp
, but I expect the behavior to be quite different (and I hope that memcmp
is beneficial). It's search
/find_end
that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.
In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search
is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.
The
equal
etc. algorithms aren't affected by this. We should probably evaluate their use ofmemcmp
, but I expect the behavior to be quite different (and I hope thatmemcmp
is beneficial)
Agreed. mismatch
/ lexicographical_compare
did benefit from similar optimization.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
Thanks for investigating and improving the performance here even if the final result was very different from the initial vision! :crystal_ball: :magic_wand: :rocket:
Resolves #2453
These benchmark results are no longer relevant, as the PR intention has changed
Before: ``` --------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------- c_strstr 184 ns 119 ns 8960000 ranges_searchstrstr
is given for a reference in the benchmark, it is not affected by the optimization.It may be impossible to reach
strstr
performance, as it usespcmpistri
(and reading beyond the last element, aspcmpistri
is not very useful otherwise). We can trypcmpestri
for 8-bit and 16-bit cases, but still it may be not as efficient, asstrstr
. I'd prefer to try this additional optimization in a next PR though.