wuffs 0.4 significantly slower than 0.3 decoding PNGs

Ono-Sendai commented 1 month ago

On Windows, Visual studio 2022, no AVX, AMD Ryzen 9 5950x.

wuffs 0.3, decoding to WUFFS_BASE__PIXEL_FORMAT__RGB
-----------------------------------------------------
doDecodeFromBufferWithWuffs took 8.4251 ms
doDecodeFromBufferWithWuffs took 8.3683 ms
doDecodeFromBufferWithWuffs took 8.4120 ms

wuffs 0.4, decoding to WUFFS_BASE__PIXEL_FORMAT__RGB
-----------------------------------------------------
doDecodeFromBufferWithWuffs took 12.893 ms
doDecodeFromBufferWithWuffs took 12.906 ms
doDecodeFromBufferWithWuffs took 12.829 ms

nigeltao commented 1 month ago

Thanks for the bug report. I don't run Windows or VS myself. Are you able to bisect a few Wuffs versions to see which commit caused the slow down?

If you're using wuffs-v0.4.c then there's only a handful of relevant commits (and f1698226 is the most recent):

f1698226 (tag: v0.4.0-alpha.4) release: wuffs gen -version=0.4.0-alpha.4
5c693c9d (tag: v0.4.0-alpha.3) release: wuffs gen -version=0.4.0-alpha.3
3eb4e3a6 (tag: v0.4.0-alpha.2) release: wuffs gen -version=0.4.0-alpha.2
6381a325 (tag: v0.4.0-alpha.1) release: wuffs gen -version=0.4.0-alpha.1

That's pretty coarse-grained though. If you have the time, I'd find it more helpful if you can bisect using wuffs-unsupported-snapshot.c instead. The two endpoints (below) are about 343 commits apart, so a bisect should take 9 cuts.

f1698226 (tag: v0.4.0-alpha.4) release: wuffs gen -version=0.4.0-alpha.4
a27a7598 base: disable SIMD on 32-bit (not 64-bit) x86
...
9c03b8c7 wuffsfmt: double-indents hanging lines
d910658b (tag: v0.3.3) wuffs gen -version=0.3.3

Also, what compiler flags are you using?

nigeltao commented 1 month ago

Tangentially, if this is your code:

https://github.com/glaretechnologies/glare-core/blob/2c7174ca95c68bfc4cea982ec12ac60db0bfd61d/graphics/PNGDecoder.cpp#L276

https://github.com/glaretechnologies/glare-core/blob/2c7174ca95c68bfc4cea982ec12ac60db0bfd61d/graphics/PNGDecoder.cpp#L322

Then an explicit "Convert from BGR to RGB" shouldn't be necessary, Instead, I think that you can change line 276 from this:

use_pixelformat = WUFFS_BASE__PIXEL_FORMAT__BGR

to this:

use_pixelformat = WUFFS_BASE__PIXEL_FORMAT__RGB

nigeltao commented 1 month ago

I'd also like to know whether any or these #ifdefs trigger for your VS compiler flags:

#if defined(_M_IX86)
#if defined(_M_X64)
#if defined(__AVX__)

Ono-Sendai commented 1 month ago

Tangentially, if this is your code:

https://github.com/glaretechnologies/glare-core/blob/2c7174ca95c68bfc4cea982ec12ac60db0bfd61d/graphics/PNGDecoder.cpp#L276

https://github.com/glaretechnologies/glare-core/blob/2c7174ca95c68bfc4cea982ec12ac60db0bfd61d/graphics/PNGDecoder.cpp#L322

Then an explicit "Convert from BGR to RGB" shouldn't be necessary, Instead, I think that you can change line 276 from this:
use_pixelformat = WUFFS_BASE__PIXEL_FORMAT__BGR
to this:
use_pixelformat = WUFFS_BASE__PIXEL_FORMAT__RGB
Yeah that's my code. If I use WUFFS_BASE__PIXEL_FORMAT__RGB on line 276 then I get "Error: base: unsupported pixel swizzler option" on PngSuite-2013jan13/tbbn3p08.png

In addition, despite being relatively unoptimised code, my BGRA->RGBA swizzling seems faster than Wuffs:

wuffs 0.3, WUFFS_BASE__PIXEL_FORMAT__RGB:
-------------------------------------
doDecodeFromBufferWithWuffs took 9.1046 ms
doDecodeFromBufferWithWuffs took 8.9896 ms
doDecodeFromBufferWithWuffs took 9.0422 ms

wuffs 0.3, WUFFS_BASE__PIXEL_FORMAT__BGR and BGRA->RGBA swizzling myself:
-------------------------------------
doDecodeFromBufferWithWuffs took 8.9411 ms
doDecodeFromBufferWithWuffs took 8.8994 ms
doDecodeFromBufferWithWuffs took 8.9056 ms
doDecodeFromBufferWithWuffs took 8.9171 ms

Ono-Sendai commented 1 month ago

I'd also like to know whether any or these #ifdefs trigger for your VS compiler flags:
#if defined(_M_IX86)
#if defined(_M_X64)
#if defined(__AVX__)

Just _M_X64. I'm not building with AVX.

Ono-Sendai commented 1 month ago

wuffs 0.4, alpha 1
------------------
doDecodeFromBufferWithWuffs took 8.8656 ms
doDecodeFromBufferWithWuffs took 8.7829 ms
doDecodeFromBufferWithWuffs took 8.8434 ms

wuffs 0.4, alpha 2
------------------
doDecodeFromBufferWithWuffs took 8.8657 ms
doDecodeFromBufferWithWuffs took 8.8763 ms
doDecodeFromBufferWithWuffs took 8.7356 ms

wuffs 0.4, alpha 3
------------------
doDecodeFromBufferWithWuffs took 8.6924 ms
doDecodeFromBufferWithWuffs took 8.6936 ms
doDecodeFromBufferWithWuffs took 8.7011 ms

wuffs 0.4, alpha 4
------------------
doDecodeFromBufferWithWuffs took 12.211 ms
doDecodeFromBufferWithWuffs took 12.057 ms
doDecodeFromBufferWithWuffs took 12.018 ms

nigeltao commented 1 month ago

I suspect that the slow-down is due to SIMD code no longer being used.

Comparing v0.4.0-alpha.3 and v0.4.0-alpha.4, do any of these #ifdefs trigger?

#if defined (WUFFS_BASE__CPU_ARCH__X86_FAMILY)
#if defined (WUFFS_BASE__CPU_ARCH__X86_64)

Also, there's a wuffs_base__cpu_arch__have_x86_sse42 function. What does it return for you (again, comparing alpha.3 and alpha.4):

static inline bool  //
wuffs_base__cpu_arch__have_x86_sse42(void) {
  etc;
}

I'm not building with AVX.

Out of curiosity (I'm not familiar with MSVC / Visual Studio), both wuffs-v0.3.c and wuffs-v0.4.c have this line:

#pragma message("Wuffs with MSVC+IX86/X64 needs /arch:AVX for best performance")

Did you notice that at all, when building Wuffs?

nigeltao commented 1 month ago

I suspect that the slow-down is due to SIMD code no longer being used.

That might not be true, though. Fortunately, there's not that many commits between v0.4.0-alpha.3 and v0.4.0-alpha.4. Are you able to switch over from wuffs-v0.4.c to wuffs-unsupported-snapshot.c and bisect these commits:

f1698226 (tag: v0.4.0-alpha.4) release: wuffs gen -version=0.4.0-alpha.4
a27a7598 base: disable SIMD on 32-bit (not 64-bit) x86
4209264a cgen: disable SIMD on 32-bit (not 64-bit) x86
fdd08200 test/3pdata: add note re XZ backdoor test files
d8c3deb0 aux: remove DynIOBuffer::allocated
00a01abc base: avoid some (ptr + len) when ptr is NULL
d7f6b81d base: have empty_foo() functions use NULL ptr
2142aa65 cgen: fix iterate's "ptr + len" when ptr is NULL
8ed5d132 example/gifplayer: tweak try_allocate workbuf len
a9a6d8ac base: make malloc_slice_u8(etc, 0) return empty
f32bfe95 std/crc64: optimize for x86+SSE4.2
5c693c9d (tag: v0.4.0-alpha.3) release: wuffs gen -version=0.4.0-alpha.3

Ono-Sendai commented 1 month ago

I suspect that the slow-down is due to SIMD code no longer being used.

Comparing v0.4.0-alpha.3 and v0.4.0-alpha.4, do any of these #ifdefs trigger?
#if defined (WUFFS_BASE__CPU_ARCH__X86_FAMILY)
#if defined (WUFFS_BASE__CPU_ARCH__X86_64)

WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me.

Also, there's a `wuffs_base__cpu_arch__have_x86_sse42` function. What does it return for you (again, comparing alpha.3 and alpha.4):
static inline bool // wuffs_base__cpu_arch__have_x86_sse42(void) { etc; }
> I'm not building with AVX.

Out of curiosity (I'm not familiar with MSVC / Visual Studio), both `wuffs-v0.3.c` and `wuffs-v0.4.c` have this line:
pragma message("Wuffs with MSVC+IX86/X64 needs /arch:AVX for best performance")
Did you notice that at all, when building Wuffs?

Yes, and it's kind of annoying :)

Looking at the code, the issue seems to be

#if defined(__AVX__) || defined(__clang__)

preventing SSE from being used.

As mentioned before I'm not building with AVX.

nigeltao commented 1 month ago

WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me.

To be clear, are you saying "WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me" for just v0.4.0-alpha.4 or for both v0.4.0-alpha.3 and v0.4.0-alpha.4?

Again, what does wuffs_base__cpu_arch__have_x86_sse42 return, for both alpha.3 and alpha.4?

pragma message("Wuffs with MSVC+IX86/X64 needs /arch:AVX for best performance")

Did you notice that at all, when building Wuffs?

Yes, and it's kind of annoying :)

Yeah, it's not ideal, but I don't know how to make it better.

Wuffs ships as a "single file C library". And in gcc or clang, code can opt-in to "compile me with SIMD enabled" via an __attribute__ on a function by function basis, by the library writer (me). But if I understand correctly, for Visual Studio, enabling SIMD is on a file by file basis (e.g. the /arch:AVX compile-time switch), by the library user (you). (Yeah, technically, by "file" I mean "translation unit".)

So, for VS, I'd like the single file C library to work out of the box (even if, by default, it's leaving significant performance on the table), and it does, but the wuffs-v0.4.c file itself can't self-configure VS to SIMD-enable the SIMD-using functions, and the #pragma message was the least worst mechanism I could think of to try to raise awareness for VS-using programmers.

Sort of tangential to the original post, but as you're concerned about PNG decode performance: if your CPUs are less than 10 years old, then I'm curious how that "8.4 milliseconds" number changes if you do pass /arch:AVX, on either Wuffs alpha.3 and alpha.4.

Ono-Sendai commented 1 month ago

So, for VS, I'd like the single file C library to work out of the box (even if, by default, it's leaving significant performance on the table), and it does, but the wuffs-v0.4.c file itself can't self-configure VS to SIMD-enable the SIMD-using functions, and the #pragma message was the least worst mechanism I could think of to try to raise awareness for VS-using programmers.

Maybe you could add a preprocessor option to suppress the warning? something like WUFFS_NO_MISSING_AVX_WARNING

Ono-Sendai commented 1 month ago

Sort of tangential to the original post, but as you're concerned about PNG decode performance: if your CPUs are less than 10 years old, then I'm curious how that "8.4 milliseconds" number changes if you do pass /arch:AVX, on either Wuffs alpha.3 and alpha.4.
AVX not set
---------------------------------------------
doDecodeFromBufferWithWuffs took 8.8381 ms
doDecodeFromBufferWithWuffs took 8.7475 ms
doDecodeFromBufferWithWuffs took 8.6410 ms
doDecodeFromBufferWithWuffs took 8.6766 ms

With Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2)

doDecodeFromBufferWithWuffs took 8.6368 ms doDecodeFromBufferWithWuffs took 8.6676 ms doDecodeFromBufferWithWuffs took 8.5720 ms doDecodeFromBufferWithWuffs took 8.6433 ms

Ono-Sendai commented 1 month ago

WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me.

To be clear, are you saying "WUFFS_BASE__CPU_ARCH__X86_64 is not defined for me" for just v0.4.0-alpha.4 or for both v0.4.0-alpha.3 and v0.4.0-alpha.4?

It's not defined for me using v0.4.0-alpha.4. Not sure about alpha 3.

Ono-Sendai commented 1 month ago

Again, what does wuffs_base__cpu_arch__have_x86_sse42 return, for both alpha.3 and alpha.4?

I think I have done enough remote debugging for now sorry. I think you need a windows build machine :)

nigeltao commented 1 month ago

I think I have done enough remote debugging for now sorry. I think you need a windows build machine :)

OK, but in that case, I don't expect this bug to be fixed any time soon. Sorry.

Ono-Sendai commented 1 month ago

Isn't it clear what the issue is? The SSE code is only being used when AVX is defined.

nigeltao commented 1 month ago

For Wuffs + Visual Studio, "SSE code is only being used when AVX is defined" was true for Wuffs v0.3, v0.4.0-alpha.3 and v0.4.0-alpha.4. All three versions have the same #if defined(__AVX__) || defined(__clang__) guard.

At least, I think it's the same across all three versions, and that's why I asked you previously if you could confirm that (for older versions not just v0.4.0-alpha.4).

If so, "SSE only when AVX defined" doesn't explain why performance regressed between v0.3 and v0.4, or between v0.4.0-alpha.3 and v0.4.0-alpha.4.

I don't think it's clear yet what the issue is.

nigeltao commented 1 month ago

But also, I don't think "SSE code is only being used when AVX is defined" has an obvious fix. SSE isn't a single thing, it's at least six different things: SSE, SSE2, SSE3, SSSE3 and SSE4.1 and SSE4.2. If all you have is _M_X64 then, IIUC (if I understand correctly), you automatically have the first two extensions enabled, SSE and SSE2, but the later four extensions are not mandatory.

On the other hand, Wuffs "SSE code" (in both version 0.3 and 0.4) uses intrinsics like _mm_shuffle_epi8 that are part of SSSE3 (and therefore IIUC not automatically enabled by _M_X64 alone). If you're on Visual Studio and you want Wuffs' SIMD code enabled then you (the library user, not me the library writer) are going to have to configure or define something CPU-arch specific, along the lines of /arch:AVX.

Ono-Sendai commented 1 month ago

There's a few things going on here I think. The first is an issue with wuffs (all versions):

WUFFS_BASECPU_ARCH__X86_FAMILY and are only defined when `AVXorclang__` is defined (under MSVC). This is incorrect.

The second thing that is going on is that I was working around this (I guess you could call it) by defining WUFFS_BASECPU_ARCHX86_FAMILY myself before including the wuffs c file. This workaround stopped working well in wuffs 0. 4 alpha 4 when WUFFS_BASECPU_ARCHX86_FAMILY became used less. (in fact not used at all)

Ono-Sendai commented 1 month ago

But also, I don't think "SSE code is only being used when AVX is defined" has an obvious fix. SSE isn't a single thing, it's at least six different things: SSE, SSE2, SSE3, SSSE3 and SSE4.1 and SSE4.2. If all you have is _M_X64 then, IIUC (if I understand correctly), you automatically have the first two extensions enabled, SSE and SSE2, but the later four extensions are not mandatory.

On the other hand, Wuffs "SSE code" (in both version 0.3 and 0.4) uses intrinsics like _mm_shuffle_epi8 that are part of SSSE3 (and therefore IIUC not automatically enabled by _M_X64 alone). If you're on Visual Studio and you want Wuffs' SIMD code enabled then you (the library user, not me the library writer) are going to have to configure or define something CPU-arch specific, along the lines of /arch:AVX.

Well currently just including wuffs by default on windows x64 wouldn't even use SSE2.

Solution I think is to detect building on x64, then allow SSE1 and SSE2. Then there's the question of supporting SSE3, 4.1, 4.2 etc. If wuffs can choose code at runtime then you can use CPUID based switching. Otherwise you could detect a preprocessor definition like __SSE4_1__. MSVC doesn't have options for targetting SSE4.1, SSE4.2 in particular, so these are borrowed from the GCC definitions.

Or you could call it something like WUFFS_USE_SSE_4_1

nigeltao commented 1 month ago

WUFFS_BASECPU_ARCH__X86_FAMILY and are only defined when `AVXorclang__` is defined (under MSVC). This is incorrect.

I'd say more "poorly named" than "incorrect". WUFFS_BASE__APPLY_X86_SIMD_OPTIMIZATIONS_AND_YOU_CAN_ASSUME_ALL_OF_SSE_NOT_JUST_SSE1_AND_SSE2 is more accurate, but maybe not a better name.

In terms of "SIMD capability granularities" and not "which WUFFS macros trigger which SIMD code paths", I'm only looking at MSVC documentation (I'm not running MSVC myself) but for e.g. SSE4.1 code, not SSE1 or SSE2, https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?view=msvc-170 says you're going to need /arch:AVX. There is no /arch:SSE41 option.

nigeltao commented 1 month ago

I was working around this (I guess you could call it) by defining WUFFS_BASE__CPU_ARCH__X86_FAMILY myself before including the wuffs c file.

Ah, "defining it yourself" is a crucial bit of information that wasn't obvious from the earlier conversation. Having a second look at your graphics/PNGDecoder.cpp, I see where you're doing that.

That macro is a private implementation detail and it's not designed for library users to configure. Wuffs' documentation could certainly be better, but only the macros starting with WUFFS_CONFIG__ are the user-configuration knobs.

If you need an immediate workaround for v0.4.0-alpha.4 then add #define WUFFS_BASE__CPU_ARCH__X86_64 right next to where your graphics/PNGDecoder.cpp file says #define WUFFS_BASE__CPU_ARCH__X86_FAMILY. However, in an upcoming v0.4.0-alpha.5 the macros will probably be renamed to something like WUFFS_PRIVATE_IMPL__CPU_ARCH__BLAH_BLAH_BLAH to emphasize that you (the library user) are not supposed to #define it yourself.

I'll think about whether to introduce a WUFFS_CONFIG__MSVC__ASSUME_MINIMUM_CPU_ARCH__X86_64_V2 or similarly-named knob.

I wouldn't have done so in the past as I didn't know how MSVC would behave when you try to compile SSE4.2 intrinsics without also passing the /arch:AVX option. But from this bug report, it sounds like you're been running with that just fine.

nigeltao commented 1 month ago

If you're curious about why Wuffs' "SSE-family enabled" behavior changed, from depending on WUFFS_BASE__CPU_ARCH__X86_FAMILY to WUFFS_BASE__CPU_ARCH__X86_64, this fixed issue #145. Intel's SIMD documentation said that the _mm_cvtsi64_si128 intrinsic function was part of SSE2. And, as far as I could tell, part of SSE2 unconditionally. The gcc and clang compiler implementations would accept that when targeting 64-bit x86 but had compile errors when targeting 32-bit x86, even though 32-bit x86 can support both SSE2 and a uint64_t type.

Rather than dealing with any similar-to-#145 issues in the future, it seemed simpler for Wuffs to enable its SIMD code (both SSE-family and AVX) only on 64-bit x86, not both 32-bit and 64-bit x86. That involved the WUFFS_BASE__CPU_ARCH__X86_FAMILY macro but, as I said, that macro is supposed to be a private implementation detail.

nigeltao commented 1 month ago

Solution I think is to detect building on x64, then allow SSE1 and SSE2.

That won't help Wuffs' PNG performance. Here's a code snippet from wuffs_png__decoder__filter_4_distance_4_x86_sse42:

while (v_curr.ptr < i_end1_curr) {
  v_b128 = _mm_cvtsi32_si128((int32_t)(wuffs_base__peek_u32le__no_bounds_check(v_prev.ptr)));
  v_b128 = _mm_unpacklo_epi8(v_b128, v_z128);
  v_pa128 = _mm_sub_epi16(v_b128, v_c128);
  v_pb128 = _mm_sub_epi16(v_a128, v_c128);
  v_pc128 = _mm_add_epi16(v_pa128, v_pb128);
  v_pa128 = _mm_abs_epi16(v_pa128);
  v_pb128 = _mm_abs_epi16(v_pb128);
  v_pc128 = _mm_abs_epi16(v_pc128);
  v_smallest128 = _mm_min_epi16(v_pc128, _mm_min_epi16(v_pb128, v_pa128));
  v_p128 = _mm_blendv_epi8(_mm_blendv_epi8(v_c128, v_b128, _mm_cmpeq_epi16(v_smallest128, v_pb128)), v_a128, _mm_cmpeq_epi16(v_smallest128, v_pa128));
  v_x128 = _mm_cvtsi32_si128((int32_t)(wuffs_base__peek_u32le__no_bounds_check(v_curr.ptr)));
  v_x128 = _mm_unpacklo_epi8(v_x128, v_z128);
  v_x128 = _mm_add_epi8(v_x128, v_p128);
  v_a128 = v_x128;
  v_c128 = v_b128;
  v_x128 = _mm_packus_epi16(v_x128, v_x128);
  wuffs_base__poke_u32le__no_bounds_check(v_curr.ptr, ((uint32_t)(_mm_cvtsi128_si32(v_x128))));
  v_curr.ptr += 4;
  v_prev.ptr += 4;
}

That all works fine if you assume SSE4.2 (and earlier SSEs). But what if you only have SSE1 and SSE2? It turns out that e.g. the _mm_blendv_epi8 function in the middle of all that needs SSE4.1 so we can't use wuffs_png__decoder__filter_4_distance_4_x86_sse42 as is, if you only have SSE1 and SSE2.

Then there's the question of supporting SSE3, 4.1, 4.2 etc. If wuffs can choose code at runtime then you can use CPUID based switching.

Wuffs does choose code at run time. Furthermore, Wuffs' compiler (it takes in *.wuffs code and generates a C file) ensures at compile time that you can't call SIMD intrinsics (at run time) unless you've previously confirmed (at run time, via CPU ID) that your CPU is capable.

Currently, Wuffs' "can I call this __m128i-using SIMD intrinsic function" compile-time check is simple. Either all or none of the SSE extensions are allowed. Changing that from "all or none" to distinguishing a third "SSE1 and SSE2 only: _mm_unpacklo_epi8 (and hundreds of others) are allowed but _mm_blendv_epi8 (and hundreds of others) are not" is technically possible but more complicated. It has an opportunity cost and I'm not sure if it's even going to have much benefit since, as above, things like wuffs_png__decoder__filter_4_distance_4_x86_sse42 won't run as is in a "SSE1 and SSE2 only" mode and, on the other hand, the fraction of CPUs these days that are SSE2 but not SSE4.2 capable is surely very small.

Ono-Sendai commented 1 month ago

Yeah, I'm not interested in just targeting SSE2. Targeting SSE 4.2 is much more reasonable these days.

google / wuffs

wuffs 0.4 significantly slower than 0.3 decoding PNGs #148

pragma message("Wuffs with MSVC+IX86/X64 needs /arch:AVX for best performance")

pragma message("Wuffs with MSVC+IX86/X64 needs /arch:AVX for best performance")

With Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2)