Closed kugelrund closed 2 years ago
Is there an inherent reason for that?
The reason is that I don't use x86 (32-bit) or MSVC day-to-day, so I was conservative. We could possibly change it to
#if defined(_M_X64) || defined(_M_IX86)
although some care might be needed, if other code (PNG-related or otherwise) is depending on the 64-ness that WUFFS_BASE__CPU_ARCH__X86_64
currently guarantees.
Out of curiousity, and if profiling is not too much work for you, do you know which Wuffs functions get greatly improved performance once you define WUFFS_BASE__CPU_ARCH__X86_64 for your 32-bit MSVC-compiled program?
Sure, the main differences for me are basically
wuffs_png__decoder__filter_4_distance_4_x86_sse42
being called instead of wuffs_png__decoder__filter_4_distance_4_fallback
wuffs_png__decoder__filter_4_distance_3_x86_sse42
being called instead of wuffs_png__decoder__filter_4_distance_3_fallback
wuffs_adler32__hasher__up_x86_sse42
being called instead of wuffs_adler32__hasher__up__choosy_default
wuffs_crc32__ieee_hasher__up_x86_avx2
being called instead of wuffs_crc32__ieee_hasher__up__choosy_default
The filter_4_distance_4
function takes the majority of the time for me, so that being faster is the main win. Perhaps I should have mentioned that I need RGBA output, I suppose that determines which filter function is necessary.
Here is an example profile (with only wuffs functions listed), before adding || defined(_M_IX86)
:
Function Name Total CPU [unit, %] Self CPU [unit, %]
wuffs_png__decoder__decode_frame 31681 (75,78 %) 1 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle 19626 (46,95 %) 0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default 19619 (46,93 %) 30 ( 0,07 %)
wuffs_png__decoder__filter_4 18324 (43,83 %) 9 ( 0,02 %)
wuffs_png__decoder__filter_4_distance_4_fallback 16389 (39,20 %) 16387 (39,20 %)
wuffs_png__decoder__decode_pass 12053 (28,83 %) 1 ( 0,00 %)
wuffs_zlib__decoder__transform_io 11598 (27,74 %) 3 ( 0,01 %)
wuffs_deflate__decoder__transform_io 10147 (24,27 %) 0 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks 10109 (24,18 %) 2 ( 0,00 %)
wuffs_deflate__decoder__decode_huffman_fast32 9793 (23,43 %) 9515 (22,76 %)
wuffs_png__decoder__filter_4_distance_3_fallback 1928 ( 4,61 %) 1928 ( 4,61 %)
wuffs_adler32__hasher__update_u32 1447 ( 3,46 %) 1 ( 0,00 %)
wuffs_adler32__hasher__up 1446 ( 3,46 %) 0 ( 0,00 %)
wuffs_adler32__hasher__up__choosy_default 1446 ( 3,46 %) 1446 ( 3,46 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice 1263 ( 3,02 %) 5 ( 0,01 %)
wuffs_crc32__ieee_hasher__update_u32 566 ( 1,35 %) 4 ( 0,01 %)
wuffs_crc32__ieee_hasher__up 562 ( 1,34 %) 0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up__choosy_default 562 ( 1,34 %) 562 ( 1,34 %)
wuffs_base__pixel_swizzler__bgrw__bgr 510 ( 1,22 %) 451 ( 1,08 %)
wuffs_deflate__decoder__init_dynamic_huffman 306 ( 0,73 %) 86 ( 0,21 %)
wuffs_deflate__decoder__init_huff 257 ( 0,61 %) 257 ( 0,61 %)
wuffs_deflate__decoder__init_fixed_huffman 40 ( 0,10 %) 2 ( 0,00 %)
And after adding || defined(_M_IX86)
:
Function Name Total CPU [unit, %] Self CPU [unit, %]
wuffs_png__decoder__decode_frame 16978 (63,48 %) 0 ( 0,00 %)
wuffs_png__decoder__decode_pass 10508 (39,29 %) 1 ( 0,00 %)
wuffs_zlib__decoder__transform_io 10437 (39,02 %) 1 ( 0,00 %)
wuffs_deflate__decoder__transform_io 10087 (37,71 %) 1 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks 10036 (37,52 %) 2 ( 0,01 %)
wuffs_deflate__decoder__decode_huffman_fast32 9712 (36,31 %) 9389 (35,10 %)
wuffs_png__decoder__filter_and_swizzle 6468 (24,18 %) 0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default 6466 (24,18 %) 23 ( 0,09 %)
wuffs_png__decoder__filter_4 5203 (19,45 %) 7 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_4_x86_sse42 4232 (15,82 %) 4231 (15,82 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice 1237 ( 4,62 %) 8 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_3_x86_sse42 964 ( 3,60 %) 964 ( 3,60 %)
wuffs_base__pixel_swizzler__bgrw__bgr 515 ( 1,93 %) 442 ( 1,65 %)
wuffs_adler32__hasher__up 349 ( 1,30 %) 0 ( 0,00 %)
wuffs_adler32__hasher__update_u32 349 ( 1,30 %) 0 ( 0,00 %)
wuffs_adler32__hasher__up_x86_sse42 349 ( 1,30 %) 349 ( 1,30 %)
wuffs_deflate__decoder__init_dynamic_huffman 316 ( 1,18 %) 78 ( 0,29 %)
wuffs_deflate__decoder__init_huff 283 ( 1,06 %) 283 ( 1,06 %)
wuffs_crc32__ieee_hasher__update_u32 92 ( 0,34 %) 7 ( 0,03 %)
wuffs_crc32__ieee_hasher__up 85 ( 0,32 %) 0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up_x86_avx2 85 ( 0,32 %) 85 ( 0,32 %)
wuffs_deflate__decoder__init_fixed_huffman 50 ( 0,19 %) 1 ( 0,00 %)
It's fixed for release/c/wuffs-unsupported-snapshot.c
. It'll be fixed in release/c/wuffs-v0.3.c
whenever the next beta (version 0.3.0-beta.10) gets minted.
In case anyone else from the future is trying something similar, I presume you are configuring MSVC with /arch:AVX
.
Works great, thank you very much.
In case anyone else from the future is trying something similar, I presume you are configuring MSVC with
/arch:AVX
.
Yes indeed.
I'm using Wuffs-the-library in an application that is restricted to x86 (32-bit) and MSVC. I noticed that the definition of
WUFFS_BASE__CPU_ARCH__X86_64
is guarded by a check for_M_X64
for MSVC (and__x86_64__
if not MSVC) in this place:https://github.com/google/wuffs/blob/786fc74b923ac259dfceb57904457a3b4797bfbd/internal/cgen/base/fundamental-public.h#L89
Is there an inherent reason for that? Or would it be possible to change those checks to e.g.
(and similarly
__i386__
if not MSVC) to allow these extensions on x86?At least for my application this change works without any problems and greatly improves performance for decoding PNGs which is what I use Wuffs for.