Open AlexGuteniev opened 2 months ago
We talked about this at the weekly maintainer meeting and we agree:
I've reported the compiler issues:
std::copy
: DevCom-10760476vector::assign
: DevCom-10760481There isn't much hope for the compiler to improve. On reporting one of the issues, similar problem was found DevCom-1262302 and it is Closed - Lower Priority
Summary
I observed that assigning or copying vector integer elements via STL algorithm with changed bit width does not engage vectorization, whereas manually-written index-based loop is vectorized in all reasonable conversion cases.
The question is what to do with it.
Whereas it may not be worth to pursue optimization of every algorithm where input and output size differs, the plain assignment/copying is common and probably deserves optimization.
Benchmark results overview
The following cases are tested:
vector::assign
using pair of iterators (assign)std::copy
of vector iterators (copy alg)for
loop with vectors index element-wise copying (copy raw)They are tested on x64 with default architecture option, also with
/d2archSSE42
and with/arch:AVX2
There are the following
memcpy
memcpy
cases, and also in some other casesmemcpy
casesBenchmark results
Bold means noticeably better than nothing or noticeably better than previous arch level for SSE42 and AVX2. assign does not vary between arch levels.
Benchmark program
```c++ // Copyright (c) Microsoft Corporation. // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception #pragma warning(disable : 4244) #includeExplanation
_Uninitialized_copy[_meow]
used invecor::assign
and_Copy_meow
used instd::copy
use metaprogramming to callmemmove
/memcpy
, but otherwise have simple loops. The compiler somehow is confused by these loops, and doesn't always vectorize them.Possible solutions
The following makes sense to me
I don't think manually vectorizing every conversion is a good idea, as there are too many of them. Though the advantage would be runtime CPU detection.