[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV

qinxianyuzi commented 2 years ago

Hello, thanks for helping: I try to use wuffs to open png files within a c++ project. I use vs2017 to compile this code, but PNG decoding is slower than OpenCV. OpenCV: 65ms wuffs: 93ms

#include "iostream"
#include "chrono"
#define WUFFS_IMPLEMENTATION
#define WUFFS_CONFIG__MODULE__PNG
#include "wuffs-v0.3.c"

uint32_t g_width = 0;
uint32_t g_height = 0;
wuffs_aux::MemOwner g_pixbuf_mem_owner(nullptr, &free);
wuffs_base__pixel_buffer g_pixbuf = { 0 };

bool load_image(const char* filename)
{
    FILE* file = stdin;
    const char* adj_filename = "<stdin>";
    if (filename) {
        FILE* f = fopen(filename, "rb");
        if (f == NULL) {
            printf("%s: could not open file\n", filename);
            return false;
        }
        file = f;
        adj_filename = filename;
    }
    g_width = 0;
    g_height = 0;
    g_pixbuf_mem_owner.reset();
    g_pixbuf = wuffs_base__null_pixel_buffer();
    wuffs_aux::DecodeImageCallbacks callbacks;
    wuffs_aux::sync_io::FileInput input(file);
    wuffs_aux::DecodeImageResult res = wuffs_aux::DecodeImage(callbacks, input);

    return true;
}

inline auto get_time()
{
    return std::chrono::high_resolution_clock::now();
}

int main(int argc, char** argv)
{
    auto start = get_time();
    bool loaded = load_image("C:/Users/huangry/Desktop/8/IMG_1071.PNG");
    if (loaded) 
        std::cout << loaded << "\n";
    auto end = get_time();
    std::chrono::duration<double> elapsed = (end - start);
    printf("Wuffs : %fs\n", elapsed.count());

    return 0;
}

nigeltao commented 2 years ago

Are you configuring Visual Studio with /arch:AVX? GCC and clang can use __attribute__((target("avx2"))) on its functions but I don't think Microsoft's VS supports that, so you have to manually opt in to SIMD acceleration. If you don't opt in, you'll get the slower (non-SIMD) fallback code.

nigeltao commented 2 years ago

If that doesn't help, can you attach the C:/Users/huangry/Desktop/8/IMG_1071.PNG file so I can try to reproduce the slowness?

nigeltao commented 2 years ago

Are you configuring Visual Studio with /arch:AVX?

Oh, also, for MSVC, make sure that you're compiling an optimized build, not a debug build. I think this is the /O2 option (that's: slash, letter-O, number-2), or its GUI equivalent, but I might be wrong (I don't use Microsoft's toolchain, day-to-day).

qinxianyuzi commented 2 years ago

Thanks, I'm trying to configure Visual Studio with avx2. And maybe clang is indispensable.

qinxianyuzi commented 2 years ago

This is PNG file.

qinxianyuzi commented 2 years ago

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

nigeltao commented 2 years ago

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

Does "it doesn't work" mean that it didn't get faster, or does it mean that you got a compiler error message, or does it mean something else? If it's an error message, can you copy/paste it here?

qinxianyuzi commented 2 years ago

It didn't get faster.

nigeltao commented 2 years ago

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

OK. Does /arch:AVX without the 2 do anything? Do you also pass /O2? It might be easier if you say what compiler flags you are passing.

Is clang faster or is it also as slow?

qinxianyuzi commented 2 years ago

It is 1.2x faster than opencv with clang

pavel-perina commented 2 years ago

Hi. I tried it on large data. Program has some internal overhead, but anyways ...

First dataset 1984x1984x1540/16bit grayscale (all times including overhead, series of 1540 images)): OpenCV/libpng: 75s WIC (windows imaging components)/file: 66s WIC/memory: 58s (because file reader had some overhead reading 26MB PNG files from HDD, it turned out to be faster to read whole file and use memory decoder) WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Second dataset 2048x2048x2048/8bit grayscale synthetic data, each PNG roughly 14kB - basically repeating b&w patterns. WUFFs: 18s everything w/overhead, 9.4s in decoder WIC/memory: 10s everything, 2.7s in decoder (3.5x faster!!!) OpenCV/libpng 21s everything, 5.4s in decoder (worse overhead due to another app layer)

About /arch:AVX ... it may do something, but MSVC is very good at finding reasons why it won't optimize loops and reasons can be printed using /Qvec-report:2 option in C++/All options/Additional Options

Bottleneck is obviously wuffs_base__io_writer__limited_copy_u32_from_history_fast for very compressible data which gives us

1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10427) : info C5002: loop not vectorized due to reason '1301'
1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10432) : info C5002: loop not vectorized due to reason '1301'
1>

And from https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170#BKMK_ReasonCode130x , 1301 = Loop stride isn't +1.

Example of code which it can optimize (if outputtype is shorter or the same, otherwise it fails with code 1203, but code logic chooses outputtype that won't overflow)

template<typename OutputType, typename InputType>
void updateBufferFromBlock(void *output, void *input, size_t n)
{
    const InputType*  pIn  = static_cast<const InputType*>(input);
          OutputType* pOut = static_cast<OutputType*>(output);

    for (size_t i = 0; i < n; i++) {
        pOut[i] += static_cast<OutputType>(pIn[i]);
    }
}

Top-down function times for realistic dataset: https://i.imgur.com/UD5a7MF.jpg compiled with /02 /arch:AVX and comparison with other decoders: https://imgur.com/a/ZEtojo9

TL;DR: either write/generate code using AVX intristic instructions or don't pre-optimize it for MSVC. Windows Imaging Components seems fastest, but it works only on WIndows (since Vista, Seven ... idk)

nigeltao commented 2 years ago

FWIW, this patch:

diff --git a/release/c/wuffs-unsupported-snapshot.c b/release/c/wuffs-unsupported-snapshot.c
index 717414f8..ef2105cb 100644
--- a/release/c/wuffs-unsupported-snapshot.c
+++ b/release/c/wuffs-unsupported-snapshot.c
@@ -11743,13 +11743,8 @@ wuffs_base__io_writer__limited_copy_u32_from_history_fast(uint8_t** ptr_iop_w,
                                                           uint32_t distance) {
   uint8_t* p = *ptr_iop_w;
   uint8_t* q = p - distance;
-  uint32_t n = length;
-  for (; n >= 3; n -= 3) {
-    *p++ = *q++;
-    *p++ = *q++;
-    *p++ = *q++;
-  }
-  for (; n; n--) {
+  size_t n = length;
+  for (size_t i = 0; i < n; i++) {
     *p++ = *q++;
   }
   *ptr_iop_w = p;

looks like your updateBufferFromBlock suggestion, but the benchmark results are mixed. clang11 gets worse, gcc10 gets better.

name                                              old speed     new speed     delta

wuffs_deflate_decode_1k_full_init/clang11         181MB/s ± 1%  179MB/s ± 1%  -1.36%  (p=0.008 n=5+5)
wuffs_deflate_decode_1k_part_init/clang11         215MB/s ± 0%  206MB/s ± 0%  -4.53%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/clang11        388MB/s ± 0%  362MB/s ± 1%  -6.64%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/clang11        398MB/s ± 0%  370MB/s ± 0%  -7.14%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/clang11   496MB/s ± 0%  489MB/s ± 0%  -1.47%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/clang11  313MB/s ± 0%  302MB/s ± 0%  -3.40%  (p=0.008 n=5+5)

wuffs_deflate_decode_1k_full_init/gcc10           177MB/s ± 0%  179MB/s ± 1%    ~     (p=0.056 n=5+5)
wuffs_deflate_decode_1k_part_init/gcc10           206MB/s ± 0%  209MB/s ± 0%  +1.51%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/gcc10          384MB/s ± 0%  386MB/s ± 0%  +0.73%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/gcc10          393MB/s ± 0%  397MB/s ± 0%  +1.08%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/gcc10     496MB/s ± 0%  523MB/s ± 0%  +5.30%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/gcc10    314MB/s ± 0%  336MB/s ± 1%  +6.96%  (p=0.008 n=5+5)

mimic_deflate_decode_1k_full_init/gcc10           229MB/s ± 1%  228MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_10k_full_init/gcc10          275MB/s ± 0%  275MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_100k_just_one_read/gcc10     336MB/s ± 0%  335MB/s ± 0%  -0.37%  (p=0.008 n=5+5)
mimic_deflate_decode_100k_many_big_reads/gcc10    263MB/s ± 0%  264MB/s ± 0%    ~     (p=0.310 n=5+5)

In any case, I'm not sure if AVX-ness (or not) would really help here. The destination and source byte slices can overlap, often by only a few bytes, in which case you can't just do a simple memcpy 32 bytes at a atime.

nigeltao commented 2 years ago

WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Wuffs should be able to decode to WUFFS_BASE__PIXEL_FORMAT__Y_16LE or WUFFS_BASE__PIXEL_FORMAT__Y_16BE, but you have to opt into that (instead of defaulting to WUFFS_BASE__PIXEL_FORMAT__BGRA_PREMUL). If you're using Wuffs' C++ API, then that involves overriding the SelectPixfmt method (like example/sdl-imageviewer/sdl-imageviewer.cc).

nigeltao commented 2 years ago

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed60f557876ace30a3c8e8b637ea80aadc2f noticably improves PNG decode speed.

pavel-perina commented 2 years ago

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.

I'm sorry, little busy this week, hopefully will get to this issue next week.

nigeltao commented 2 years ago

@pavel-perina any news?

google / wuffs

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72