Closed AlexGuteniev closed 1 week ago
I also dropped the bool _Unused
.
With extra dispatcher, the inlining decisions are different. Now, the dispatcher is inlined into the exported functions, along with the scalar implementation. The vector implementations are tail called, and signature variations are not likely to prevent that.
Results as a table
Benchmark | Before | After |
---|---|---|
bm<uint8_t, 8021, Op::Min> | 248 ns | 170 ns |
bm<uint8_t, 8021, Op::Max> | 238 ns | 170 ns |
bm<uint8_t, 8021, Op::Both> | 371 ns | 295 ns |
bm<uint8_t, 8021, Op::Min_val> | 131 ns | 76.7 ns |
bm<uint8_t, 8021, Op::Max_val> | 128 ns | 75.4 ns |
bm<uint8_t, 8021, Op::Both_val> | 4197 ns | 4312 ns |
bm<uint16_t, 8021, Op::Min> | 459 ns | 336 ns |
bm<uint16_t, 8021, Op::Max> | 445 ns | 318 ns |
bm<uint16_t, 8021, Op::Both> | 685 ns | 553 ns |
bm<uint16_t, 8021, Op::Min_val> | 247 ns | 142 ns |
bm<uint16_t, 8021, Op::Max_val> | 252 ns | 139 ns |
bm<uint16_t, 8021, Op::Both_val> | 4239 ns | 4306 ns |
bm<uint32_t, 8021, Op::Min> | 979 ns | 615 ns |
bm<uint32_t, 8021, Op::Max> | 932 ns | 623 ns |
bm<uint32_t, 8021, Op::Both> | 1439 ns | 1071 ns |
bm<uint32_t, 8021, Op::Min_val> | 501 ns | 258 ns |
bm<uint32_t, 8021, Op::Max_val> | 494 ns | 258 ns |
bm<uint32_t, 8021, Op::Both_val> | 673 ns | 374 ns |
bm<uint64_t, 8021, Op::Min> | 4252 ns | 3540 ns |
bm<uint64_t, 8021, Op::Max> | 4360 ns | 3468 ns |
bm<uint64_t, 8021, Op::Both> | 4397 ns | 4271 ns |
bm<uint64_t, 8021, Op::Min_val> | 3844 ns | 2917 ns |
bm<uint64_t, 8021, Op::Max_val> | 3857 ns | 2974 ns |
bm<uint64_t, 8021, Op::Both_val> | 3862 ns | 3090 ns |
bm<int8_t, 8021, Op::Min> | 246 ns | 177 ns |
bm<int8_t, 8021, Op::Max> | 235 ns | 177 ns |
bm<int8_t, 8021, Op::Both> | 361 ns | 288 ns |
bm<int8_t, 8021, Op::Min_val> | 126 ns | 77.4 ns |
bm<int8_t, 8021, Op::Max_val> | 128 ns | 74.1 ns |
bm<int8_t, 8021, Op::Both_val> | 3842 ns | 3843 ns |
bm<int16_t, 8021, Op::Min> | 460 ns | 339 ns |
bm<int16_t, 8021, Op::Max> | 445 ns | 321 ns |
bm<int16_t, 8021, Op::Both> | 683 ns | 553 ns |
bm<int16_t, 8021, Op::Min_val> | 251 ns | 140 ns |
bm<int16_t, 8021, Op::Max_val> | 249 ns | 139 ns |
bm<int16_t, 8021, Op::Both_val> | 3318 ns | 3377 ns |
bm<int32_t, 8021, Op::Min> | 965 ns | 620 ns |
bm<int32_t, 8021, Op::Max> | 903 ns | 625 ns |
bm<int32_t, 8021, Op::Both> | 1405 ns | 1059 ns |
bm<int32_t, 8021, Op::Min_val> | 497 ns | 254 ns |
bm<int32_t, 8021, Op::Max_val> | 505 ns | 254 ns |
bm<int32_t, 8021, Op::Both_val> | 690 ns | 379 ns |
bm<int64_t, 8021, Op::Min> | 4466 ns | 3532 ns |
bm<int64_t, 8021, Op::Max> | 4385 ns | 3462 ns |
bm<int64_t, 8021, Op::Both> | 4845 ns | 4130 ns |
bm<int64_t, 8021, Op::Min_val> | 5156 ns | 3011 ns |
bm<int64_t, 8021, Op::Max_val> | 4003 ns | 2945 ns |
bm<int64_t, 8021, Op::Both_val> | 3847 ns | 3264 ns |
bm<float, 8021, Op::Min> | 1965 ns | 1176 ns |
bm<float, 8021, Op::Max> | 2014 ns | 1208 ns |
bm<float, 8021, Op::Both> | 2254 ns | 1358 ns |
bm<float, 8021, Op::Min_val> | 1870 ns | 894 ns |
bm<float, 8021, Op::Max_val> | 1838 ns | 891 ns |
bm<float, 8021, Op::Both_val> | 1886 ns | 949 ns |
bm<double, 8021, Op::Min> | 3931 ns | 2230 ns |
bm<double, 8021, Op::Max> | 4004 ns | 2421 ns |
bm<double, 8021, Op::Both> | 4809 ns | 2765 ns |
bm<double, 8021, Op::Min_val> | 3776 ns | 1860 ns |
bm<double, 8021, Op::Max_val> | 3769 ns | 1869 ns |
bm<double, 8021, Op::Both_val> | 3850 ns | 1939 ns |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
AVX2 fast, AVX2 furious! :car: :blue_car: :racing_car:
Resolves #2803
This is not final optimization. At least, we should use AVX masks here too. But this one is complex enough already, so the rest would be follow-up PR(s).
I also notice that
Both_val
8-bit and 16-bit cases are slow. The vectorization for them is not engaged, it is a separate issue from the AVX.Benchmark results
Before: ``` --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- bm