This is the rephrasing of #4456, with all progress made so far incorporated.
count and count_if can be auto-vectorized as follows:
For sizeof(difference_type) == sizeof(T) they are already auto-vectorized
For sizeof(difference_type) < sizeof(T) can use the approach similar to #4627
For sizeof(difference_type) > sizeof(T) can also use the approach similar to #4627, but it will not cover some large array sizes. To cover large array sizes, can also split the range into smaller ranges, so that for these smaller ranges T is enough to represent the count.
For count_if this would be the only feasible way to vectorize, as predicates cannot be used in separately compiled implementation, and we don't want complex manual vectorization with intrinsics in headers for throughput reasons.
For count this can be still an alternative to manual vectorization. The performance of auto-vectorization when compiling with /arch:AVX2 seems to be not much worse than existing manual vectorization for large ranges, albeit significantly worse for small ranges with large tails (auto-vectorization doesn't do the mask thing). So we can:
Add auto-vectorization as an alternative to manual vectorization, when the latter is not available
(ARM64, or opt-out from _USE_STD_VECTOR_ALGORITHMS)
Use auto-vectorization as the only one (lose some perf for tails, but have unified vectorization implementation)
This is the rephrasing of #4456, with all progress made so far incorporated.
count
andcount_if
can be auto-vectorized as follows:sizeof(difference_type) == sizeof(T)
they are already auto-vectorizedsizeof(difference_type) < sizeof(T)
can use the approach similar to #4627sizeof(difference_type) > sizeof(T)
can also use the approach similar to #4627, but it will not cover some large array sizes. To cover large array sizes, can also split the range into smaller ranges, so that for these smaller rangesT
is enough to represent the count.For
count_if
this would be the only feasible way to vectorize, as predicates cannot be used in separately compiled implementation, and we don't want complex manual vectorization with intrinsics in headers for throughput reasons.For
count
this can be still an alternative to manual vectorization. The performance of auto-vectorization when compiling with/arch:AVX2
seems to be not much worse than existing manual vectorization for large ranges, albeit significantly worse for small ranges with large tails (auto-vectorization doesn't do the mask thing). So we can:_USE_STD_VECTOR_ALGORITHMS
)