Open qinxianyuzi opened 2 years ago
Are you configuring Visual Studio with /arch:AVX
? GCC and clang can use __attribute__((target("avx2")))
on its functions but I don't think Microsoft's VS supports that, so you have to manually opt in to SIMD acceleration. If you don't opt in, you'll get the slower (non-SIMD) fallback code.
If that doesn't help, can you attach the C:/Users/huangry/Desktop/8/IMG_1071.PNG
file so I can try to reproduce the slowness?
Are you configuring Visual Studio with
/arch:AVX
?
Oh, also, for MSVC, make sure that you're compiling an optimized build, not a debug build. I think this is the /O2
option (that's: slash, letter-O, number-2), or its GUI equivalent, but I might be wrong (I don't use Microsoft's toolchain, day-to-day).
Thanks, I'm trying to configure Visual Studio with avx2. And maybe clang is indispensable.
This is PNG file.
I configure Visual Studio with (/arch:AVX2), but it doesn't work.
I configure Visual Studio with (/arch:AVX2), but it doesn't work.
Does "it doesn't work" mean that it didn't get faster, or does it mean that you got a compiler error message, or does it mean something else? If it's an error message, can you copy/paste it here?
It didn't get faster.
I configure Visual Studio with (/arch:AVX2), but it doesn't work.
OK. Does /arch:AVX
without the 2 do anything? Do you also pass /O2
? It might be easier if you say what compiler flags you are passing.
Is clang faster or is it also as slow?
It is 1.2x faster than opencv with clang
Hi. I tried it on large data. Program has some internal overhead, but anyways ...
First dataset 1984x1984x1540/16bit grayscale (all times including overhead, series of 1540 images)): OpenCV/libpng: 75s WIC (windows imaging components)/file: 66s WIC/memory: 58s (because file reader had some overhead reading 26MB PNG files from HDD, it turned out to be faster to read whole file and use memory decoder) WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)
Second dataset 2048x2048x2048/8bit grayscale synthetic data, each PNG roughly 14kB - basically repeating b&w patterns. WUFFs: 18s everything w/overhead, 9.4s in decoder WIC/memory: 10s everything, 2.7s in decoder (3.5x faster!!!) OpenCV/libpng 21s everything, 5.4s in decoder (worse overhead due to another app layer)
About /arch:AVX ... it may do something, but MSVC is very good at finding reasons why it won't optimize loops and reasons can be printed using /Qvec-report:2 option in C++/All options/Additional Options
Bottleneck is obviously wuffs_base__io_writer__limited_copy_u32_from_history_fast
for very compressible data which gives us
1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10427) : info C5002: loop not vectorized due to reason '1301'
1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10432) : info C5002: loop not vectorized due to reason '1301'
1>
And from https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170#BKMK_ReasonCode130x , 1301 = Loop stride isn't +1.
Example of code which it can optimize (if outputtype is shorter or the same, otherwise it fails with code 1203, but code logic chooses outputtype that won't overflow)
template<typename OutputType, typename InputType>
void updateBufferFromBlock(void *output, void *input, size_t n)
{
const InputType* pIn = static_cast<const InputType*>(input);
OutputType* pOut = static_cast<OutputType*>(output);
for (size_t i = 0; i < n; i++) {
pOut[i] += static_cast<OutputType>(pIn[i]);
}
}
Top-down function times for realistic dataset: https://i.imgur.com/UD5a7MF.jpg compiled with /02 /arch:AVX and comparison with other decoders: https://imgur.com/a/ZEtojo9
TL;DR: either write/generate code using AVX intristic instructions or don't pre-optimize it for MSVC. Windows Imaging Components seems fastest, but it works only on WIndows (since Vista, Seven ... idk)
FWIW, this patch:
diff --git a/release/c/wuffs-unsupported-snapshot.c b/release/c/wuffs-unsupported-snapshot.c
index 717414f8..ef2105cb 100644
--- a/release/c/wuffs-unsupported-snapshot.c
+++ b/release/c/wuffs-unsupported-snapshot.c
@@ -11743,13 +11743,8 @@ wuffs_base__io_writer__limited_copy_u32_from_history_fast(uint8_t** ptr_iop_w,
uint32_t distance) {
uint8_t* p = *ptr_iop_w;
uint8_t* q = p - distance;
- uint32_t n = length;
- for (; n >= 3; n -= 3) {
- *p++ = *q++;
- *p++ = *q++;
- *p++ = *q++;
- }
- for (; n; n--) {
+ size_t n = length;
+ for (size_t i = 0; i < n; i++) {
*p++ = *q++;
}
*ptr_iop_w = p;
looks like your updateBufferFromBlock
suggestion, but the benchmark results are mixed. clang11 gets worse, gcc10 gets better.
name old speed new speed delta
wuffs_deflate_decode_1k_full_init/clang11 181MB/s ± 1% 179MB/s ± 1% -1.36% (p=0.008 n=5+5)
wuffs_deflate_decode_1k_part_init/clang11 215MB/s ± 0% 206MB/s ± 0% -4.53% (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/clang11 388MB/s ± 0% 362MB/s ± 1% -6.64% (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/clang11 398MB/s ± 0% 370MB/s ± 0% -7.14% (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/clang11 496MB/s ± 0% 489MB/s ± 0% -1.47% (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/clang11 313MB/s ± 0% 302MB/s ± 0% -3.40% (p=0.008 n=5+5)
wuffs_deflate_decode_1k_full_init/gcc10 177MB/s ± 0% 179MB/s ± 1% ~ (p=0.056 n=5+5)
wuffs_deflate_decode_1k_part_init/gcc10 206MB/s ± 0% 209MB/s ± 0% +1.51% (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/gcc10 384MB/s ± 0% 386MB/s ± 0% +0.73% (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/gcc10 393MB/s ± 0% 397MB/s ± 0% +1.08% (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/gcc10 496MB/s ± 0% 523MB/s ± 0% +5.30% (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/gcc10 314MB/s ± 0% 336MB/s ± 1% +6.96% (p=0.008 n=5+5)
mimic_deflate_decode_1k_full_init/gcc10 229MB/s ± 1% 228MB/s ± 0% ~ (p=0.310 n=5+5)
mimic_deflate_decode_10k_full_init/gcc10 275MB/s ± 0% 275MB/s ± 0% ~ (p=0.310 n=5+5)
mimic_deflate_decode_100k_just_one_read/gcc10 336MB/s ± 0% 335MB/s ± 0% -0.37% (p=0.008 n=5+5)
mimic_deflate_decode_100k_many_big_reads/gcc10 263MB/s ± 0% 264MB/s ± 0% ~ (p=0.310 n=5+5)
In any case, I'm not sure if AVX-ness (or not) would really help here. The destination and source byte slices can overlap, often by only a few bytes, in which case you can't just do a simple memcpy 32 bytes at a atime.
WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)
Wuffs should be able to decode to WUFFS_BASE__PIXEL_FORMAT__Y_16LE
or WUFFS_BASE__PIXEL_FORMAT__Y_16BE
, but you have to opt into that (instead of defaulting to WUFFS_BASE__PIXEL_FORMAT__BGRA_PREMUL
). If you're using Wuffs' C++ API, then that involves overriding the SelectPixfmt
method (like example/sdl-imageviewer/sdl-imageviewer.cc
).
I don't have MSVC myself, but for those who do, I'm curious if commit c226ed60f557876ace30a3c8e8b637ea80aadc2f noticably improves PNG decode speed.
I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.
I'm sorry, little busy this week, hopefully will get to this issue next week.
@pavel-perina any news?
Hello, thanks for helping: I try to use wuffs to open png files within a c++ project. I use vs2017 to compile this code, but PNG decoding is slower than OpenCV. OpenCV: 65ms wuffs: 93ms