wuffs significantly slower than OpenCV 4.9.0 when decoding PNGs for 7680x4320 image

zchrissirhcz commented 3 weeks ago

Problem

When decoding a big image (height=4320, width=7680, channels=4, data type = uint8_t), wuffs is much slow than OpenCV 4.9.0, on Apple M1 (Mac-mini).

Time cost

7680x4320 image

	time cost
opencv 4.9.0	270 ms
wuffs latest("unsupported.c")	370 ms

OpenCV 4.9.0 details

brew install opencv

which is built on libpng 1.6.43:

  Media I/O: 
    ZLib:                        /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk/usr/lib/libz.tbd (ver 1.2.12)
    JPEG:                        /opt/homebrew/lib/libjpeg.dylib (ver 80)
    WEBP:                        /opt/homebrew/lib/libwebp.dylib (ver encoder: 0x020f)
    PNG:                         /opt/homebrew/lib/libpng.dylib (ver 1.6.43)
    TIFF:                        /opt/homebrew/lib/libtiff.dylib (ver 42 / 4.6.0)
    JPEG 2000:                   OpenJPEG (ver 2.5.2)
    OpenEXR:                     OpenEXR::OpenEXR (ver 3.2.4)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

What exactly code do I use

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
// Copyright 2023 The Wuffs Authors.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// https://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or https://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.
//
// SPDX-License-Identifier: Apache-2.0 OR MIT

// ----------------

/*
toy-aux-image demonstrates using the wuffs_aux::DecodeImage C++ function to
decode an in-memory compressed image. In this example, the compressed image is
hard-coded to a specific image: a JPEG encoding of the first frame of the
test/data/muybridge.gif animated image.

To run:

$CXX toy-aux-image.cc && ./a.out; rm -f a.out

for a C++ compiler $CXX, such as clang++ or g++.

The expected output:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@X@@@@XX@@@@@@@@@@X
XXXXX@@XXX@@@@@@@II@@@X@X@@@@@
XXXXX@@XX@@X@@@XO+XXX@XX@@@X@@
XXXXXXXX@XX@X@XI=I@@XXI+OXX@XX
XXXXXXXXXXXXXXX+=+OXO+=::OXX@X
XXXXXXXXXXXXXXXXXX=+==:::=XXXX
XXXXXXXXO+:::::+OO+===+OI=+XXX
XXXO::=++:::==+++XI+++X@XXO@XX
XXXO=X@X+::=::::+O++=I@XX@XXXX
XXXXX@XXX=:::::::::=+@XXXX@XXX
XXXXXXXX@O::IXO=::::O@@XXXXXXX
XXXXXXXXO=X+X@@XX::O@@XXXXXXXX
XXXXXXXXXOO=X@X@X+OIXXXXXXXXXX
XXXXXXXXXXX+IIXX+X@OX@XXXXXXXX
XXXXXXXXX@XXOI+IIOOOXXXXXXXXXX
XXXXXXXXXXX@XXXXX@XXXXXXXXXXXX
XXXXXXXXXXXXXXXXX@XXXXXXXXXXXX
OOOOXXXXXXXXXXOXXXXXXXXXXXXOOO
=+++IIIIIIIOOOOOOOOOOIIIIIIII+
*/

// Wuffs ships as a "single file C library" or "header file library" as per
// https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
//
// To use that single file as a "foo.c"-like implementation, instead of a
// "foo.h"-like header, #define WUFFS_IMPLEMENTATION before #include'ing or
// compiling it.
#define WUFFS_IMPLEMENTATION

// Defining the WUFFS_CONFIG__STATIC_FUNCTIONS macro is optional, but when
// combined with WUFFS_IMPLEMENTATION, it demonstrates making all of Wuffs'
// functions have static storage.
//
// This can help the compiler ignore or discard unused code, which can produce
// faster compiles and smaller binaries. Other motivations are discussed in the
// "ALLOW STATIC IMPLEMENTATION" section of
// https://raw.githubusercontent.com/nothings/stb/master/docs/stb_howto.txt
#define WUFFS_CONFIG__STATIC_FUNCTIONS

// Defining the WUFFS_CONFIG__MODULE* macros are optional, but it lets users of
// release/c/etc.c choose which parts of Wuffs to build. That file contains the
// entire Wuffs standard library, implementing a variety of codecs and file
// formats. Without this macro definition, an optimizing compiler or linker may
// very well discard Wuffs code for unused codecs, but listing the Wuffs
// modules we use makes that process explicit. Preprocessing means that such
// code simply isn't compiled.
/*
#define WUFFS_CONFIG__MODULES
#define WUFFS_CONFIG__MODULE__AUX__BASE
#define WUFFS_CONFIG__MODULE__AUX__IMAGE
#define WUFFS_CONFIG__MODULE__BASE
#define WUFFS_CONFIG__MODULE__JPEG
*/
#define WUFFS_CONFIG__MODULES
#define WUFFS_CONFIG__MODULE__AUX__BASE
#define WUFFS_CONFIG__MODULE__AUX__IMAGE
#define WUFFS_CONFIG__MODULE__ADLER32
#define WUFFS_CONFIG__MODULE__BASE
#define WUFFS_CONFIG__MODULE__CRC32
#define WUFFS_CONFIG__MODULE__DEFLATE
#define WUFFS_CONFIG__MODULE__PNG
#define WUFFS_CONFIG__MODULE__ZLIB

// Defining the WUFFS_CONFIG__DST_PIXEL_FORMAT__ENABLE_ALLOWLIST (and the
// associated ETC__ALLOW_FOO) macros are optional, but can lead to smaller
// programs (in terms of binary size). By default (without these macros),
// Wuffs' standard library can decode images to a variety of pixel formats,
// such as BGR_565, BGRA_PREMUL or RGBA_NONPREMUL. The destination pixel format
// is selectable at runtime. Using these macros essentially makes the selection
// at compile time, by narrowing the list of supported destination pixel
// formats. The FOO in ETC__ALLOW_FOO should match the pixel format passed (as
// part of the wuffs_base__image_config argument) to the decode_frame method.
//
// If using the wuffs_aux C++ API, without overriding the SelectPixfmt method,
// the implicit destination pixel format is BGRA_PREMUL.
#define WUFFS_CONFIG__DST_PIXEL_FORMAT__ENABLE_ALLOWLIST
#define WUFFS_CONFIG__DST_PIXEL_FORMAT__ALLOW_BGRA_PREMUL

// If building this program in an environment that doesn't easily accommodate
// relative includes, you can use the script/inline-c-relative-includes.go
// program to generate a stand-alone C file.
//##include "wuffs-v0.4.c"
//#include "wuffs-v0.3.c"
#include "wuffs-unsupported-snapshot.c"

//static std::string decode()
cv::Mat ncv::read_png(const std::string filename)
{
  // Call wuffs_aux::DecodeImage, which is the entry point to Wuffs' high-level
  // C++ API for decoding images. This API is easier to use than Wuffs'
  // low-level C API but the low-level one (1) handles animation, (2) handles
  // asynchronous I/O, (3) handles metadata and (4) does no dynamic memory
  // allocation, so it can run under a `SECCOMP_MODE_STRICT` sandbox.
  // Obviously, if you don't need any of those features, then these simple
  // lines of code here suffices.
  //
  // This example program doesn't explicitly use Wuffs' low-level C API but, if
  // you're curious to learn more, the wuffs_aux::DecodeImage implementation in
  // internal/cgen/auxiliary/*.cc uses it, as does the example/convert-to-nia C
  // program. There's also documentation at doc/std/image-decoders.md
  //
  // If you also want metadata like EXIF orientation and ICC color profiles,
  // script/print-image-metadata.cc has some example code. It uses Wuffs'
  // low-level API but it's a C++ program to use Wuffs' shorter convenience
  // methods: `decoder->decode_frame_config(NULL, &src)` instead of C's
  // `wuffs_base__image_decoder__decode_frame_config(decoder, NULL, &src)`.
  std::ifstream file(filename, std::ios::binary | std::ios::ate);
  if (!file.is_open())
  {
    std::cerr << "failed to open file " << filename << "\n";
    return cv::Mat();
  }
  std::streampos filesize = file.tellg();
  file.seekg(0, std::ios::beg);
  std::vector<char> buffer(filesize);
  if (!file.read(buffer.data(), filesize))
  {
    std::cerr << "error: could not read file content.\n";
    return cv::Mat();
  }
  file.close();

  wuffs_aux::DecodeImageCallbacks callbacks;
  wuffs_aux::sync_io::MemoryInput input(buffer.data(), buffer.size());
  wuffs_aux::DecodeImageResult result =
      wuffs_aux::DecodeImage(callbacks, input);
  if (!result.error_message.empty()) {
    std::cerr << "error: " << result.error_message << "\n";
    return cv::Mat();
  }
  // If result.error_message is empty then the DecodeImage call succeeded. The
  // decoded image is held in result.pixbuf, backed by memory that is released
  // when result.pixbuf_mem_owner (a std::unique_ptr) is destroyed. In this
  // example program, this happens at the end of this function.

  wuffs_base__table_u8 table = result.pixbuf.plane(0);
  //printf("table: %p, %zu, %zu, %zu\n", table.ptr, table.width, table.height, table.stride);

  // print result.pixbuf.pixcfg
//   printf("bpp: %d\n", result.pixbuf.pixcfg.pixel_format().bits_per_pixel());
//   printf("human redable: height=%zu, width=%zu, channel=%zu\n", 
//     result.pixbuf.pixcfg.height(),
//     result.pixbuf.pixcfg.width(),
//     result.pixbuf.pixcfg.pixel_format().bits_per_pixel() / 8
//   );

  cv::Size size;
  size.height = result.pixbuf.pixcfg.height();
  size.width = result.pixbuf.pixcfg.width();
  int channels = result.pixbuf.pixcfg.pixel_format().bits_per_pixel() / 8;
  cv::Mat image(size, CV_8UC(channels));
  std::copy_n(table.ptr, size.width * size.height * channels, image.data);

  return image;
}

int main()
{
    std::cout << "OpenCV version (runtime): " << cv::getVersionString() << std::endl;

    //const std::string filename = "/Users/zz/data/peppers.png";
    const std::string filename = "/Users/zz/data/ASRDebug_0_7680x4320.png";
    cv::Mat src2;
    {
        birch::AutoTimer timer1("cv::imread");
        src2 = cv::imread(filename);
    }
    printf("src2: rows=%d, cols=%d\n", src2.rows, src2.cols);
    //cv::imwrite("result2.png", src2);

    cv::Mat src1;
    {
        birch::AutoTimer timer1("ncv::read_png");
        src1 = ncv::read_png(filename);
    }
    //cv::imwrite("result1.png", src1);
    printf("src2: rows=%d, cols=%d\n", src1.rows, src1.cols);

    std::cout << cv::getBuildInformation() << std::endl;

    return 0;
}

zchrissirhcz commented 3 weeks ago

The testing image size is large than github limit. For the performance test, we can just generate it from C++ code:

int create_test_7680_4320_png_image()
{
    const std::string image_path = "lena.png";
    cv::Mat image = cv::imread(image_path);
    if (image.empty()) {
        std::cerr << "image file not found" << std::endl;
        return -1;
    }

    int originalWidth = image.cols;
    int originalHeight = image.rows;

    int targetWidth = 7680;
    int targetHeight = 4320;

    int rows = targetHeight / originalHeight;
    int cols = targetWidth / originalWidth;

    cv::Mat result = cv::Mat(targetHeight, targetWidth, CV_8UC4, cv::Scalar(0, 0, 0, 0));

    cv::Mat imageWithAlpha;
    cv::cvtColor(image, imageWithAlpha, cv::COLOR_BGR2BGRA);

    for (int i = 0; i < rows; ++i) {
        for (int j = 0; j < cols; ++j) {
            int x = j * originalWidth;
            int y = i * originalHeight;

            imageWithAlpha.copyTo(result(cv::Rect(x, y, originalWidth, originalHeight)));
        }
    }
    cv::imwrite("result.png", result);

    return 0;
}

nigeltao commented 3 weeks ago

What exactly code do I use

What's the command (the compiler invocation) to build that code?

zchrissirhcz commented 3 weeks ago

I use CMake for build:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

The compiler is AppleClang:

Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

nigeltao commented 3 weeks ago

I'm not very familiar with cmake (and I don't have an Apple M1). Do you know if -DCMAKE_BUILD_TYPE=Release passes -O2 or -O3 to clang?

Also, do you know, after the #include "wuffs-unsupported-snapshot.c" line, if the WUFFS_BASE__CPU_ARCH__ARM_CRC32 and WUFFS_BASE__CPU_ARCH__ARM_NEON macros are defined?

Specifically, if you do something like

#ifdef WUFFS_BASE__CPU_ARCH__ARM_CRC32
#error "asdf1"
#else
#error "asdf2"
#endif

Do you see asdf1 or asdf2. Ditto for #ifdef WUFFS_BASE__CPU_ARCH__ARM_NEON.

zchrissirhcz commented 3 weeks ago

-O3 is used. I find it in build/compile_commands.json:

{
  "directory": "/Users/zz/work/cppsober/kcv/build",
  "command": "/Library/Developer/CommandLineTools/usr/bin/c++ -DGL_SILENCE_DEPRECATION -isystem /opt/homebrew/Cellar/opencv/4.9.0_8/include/opencv4 -isystem /Users/zz/.arcpkg/birch/autotimer/0.1/mac-arm64-static/inc/birch -O3 -DNDEBUG -std=gnu++17 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX14.2.sdk -o CMakeFiles/test_wuffs.dir/imwrite.cpp.o -c /Users/zz/work/cppsober/kcv/imwrite.cpp",
  "file": "/Users/zz/work/cppsober/kcv/imwrite.cpp",
  "output": "CMakeFiles/test_wuffs.dir/imwrite.cpp.o"
},

zchrissirhcz commented 3 weeks ago

WUFFS_BASE__CPU_ARCH__ARM_CRC32 and WUFFS_BASE__CPU_ARCH__ARM_NEON are enabled.

// To simplify Wuffs code, "cpu_arch >= arm_xxx" requires xxx but also
// unaligned little-endian load/stores.
#if defined(__ARM_FEATURE_UNALIGNED) && !defined(__native_client__) && \
    defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__)
// Not all gcc versions define __ARM_ACLE, even if they support crc32
// intrinsics. Look for __ARM_FEATURE_CRC32 instead.
#if defined(__ARM_FEATURE_CRC32)
#include <arm_acle.h>
#define WUFFS_BASE__CPU_ARCH__ARM_CRC32
#pragma message "WUFFS_BASE__CPU_ARCH__ARM_CRC32: YES" // new added
#endif  // defined(__ARM_FEATURE_CRC32)
#if defined(__ARM_NEON)
#include <arm_neon.h>
#define WUFFS_BASE__CPU_ARCH__ARM_NEON
#pragma message "WUFFS_BASE__CPU_ARCH__ARM_NEON: YES" // new added
#endif  // defined(__ARM_NEON)
#endif  // defined(__ARM_FEATURE_UNALIGNED) etc

The outout of compilation:

➜  kcv git:(main) ✗ cmake --build build -j8
[ 56%] Built target glfw
[ 76%] Built target imgui
[ 89%] Built target konacv
[ 94%] Built target test
[ 97%] Building CXX object CMakeFiles/test_wuffs.dir/imwrite.cpp.o
In file included from /Users/zz/work/cppsober/kcv/imwrite.cpp:117:
/Users/zz/work/cppsober/kcv/wuffs-unsupported-snapshot.c:120:9: warning: WUFFS_BASE__CPU_ARCH__ARM_CRC32: YES [-W#pragma-messages]
#pragma message "WUFFS_BASE__CPU_ARCH__ARM_CRC32: YES"
        ^
/Users/zz/work/cppsober/kcv/wuffs-unsupported-snapshot.c:125:9: warning: WUFFS_BASE__CPU_ARCH__ARM_NEON: YES [-W#pragma-messages]
#pragma message "WUFFS_BASE__CPU_ARCH__ARM_NEON: YES"
        ^
2 warnings generated.
[100%] Linking CXX executable test_wuffs
[100%] Built target test_wuffs

nigeltao commented 3 weeks ago

OK, I don't think there's an obvious fix. Still, I don't have an Apple M1 so it might take me a while to make progress on this.

Can you e-mail the image file (or a link to it) to nigeltao golang org? Thanks.

zchrissirhcz commented 3 weeks ago

OK, I don't think there's an obvious fix. Still, I don't have an Apple M1 so it might take me a while to make progress on this.

Can you e-mail the image file (or a link to it) to nigeltao golang org? Thanks.

Been sent, please check.

nigeltao commented 2 weeks ago

Thanks for sharing your 7680x4320 image. My wuffs bench time-to-decode numbers on x86_64 Intel (i5-10210U Comet Lake), not arm64 Apple (M1):

370ms wuffs latest (clang 14)
334ms wuffs latest (gcc 12)
533ms libpng (Debian 12 Bookworm)

Looks like I'm going to have to find an Apple M1 (or similar)...

google / wuffs