Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestr*`

AlexGuteniev commented 4 days ago

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark	main	ths
c_strstr/0	186 ns	184 ns
c_strstr/1	213 ns	213 ns
classic_search/0	2045 ns	270 ns
classic_search/1	2221 ns	302 ns
classic_search/0	1588 ns	531 ns
classic_search/1	1766 ns	586 ns
ranges_search/0	1748 ns	268 ns
ranges_search/1	1989 ns	306 ns
ranges_search/0	1673 ns	585 ns
ranges_search/1	1843 ns	600 ns
search_default_searcher/0	1494 ns	269 ns
search_default_searcher/1	1626 ns	309 ns
search_default_searcher/0	2002 ns	528 ns
search_default_searcher/1	2286 ns	599 ns

AlexGuteniev commented 4 days ago

The previous attempt was #4654 and it ended up being just memcmp removal; see https://github.com/microsoft/STL/pull/4654#issuecomment-2159504811

StephanTLavavej commented 4 days ago

:warning: Note to self, check the benchmarks on my machine.

AlexGuteniev commented 3 days ago

Large needles are expected to be not much worse performance, but there would be a different branch with a somewhat different approach. I can do them in this PR and not subsequent PR, if you prefer that.

StephanTLavavej commented 1 day ago

This PR isn't too massive, I think it would make sense to add large needle logic here.

AlexGuteniev commented 15 hours ago

This PR isn't too massive, I think it would make sense to add large needle logic here.

Updated. Can raise the bet even more by adding find_end or adding search_n or by trying to make 4 and 8 byte elements (although I'm skeptical on larger elements).

StephanTLavavej commented 13 hours ago

/azp run

azure-pipelines[bot] commented 13 hours ago

Azure Pipelines successfully started running 2 pipeline(s).

microsoft / STL

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestr*` #4745