Support for SSE/AVX when targeting x86?

kugelrund commented 2 years ago

I'm using Wuffs-the-library in an application that is restricted to x86 (32-bit) and MSVC. I noticed that the definition of WUFFS_BASE__CPU_ARCH__X86_64 is guarded by a check for _M_X64 for MSVC (and __x86_64__ if not MSVC) in this place:

https://github.com/google/wuffs/blob/786fc74b923ac259dfceb57904457a3b4797bfbd/internal/cgen/base/fundamental-public.h#L89

Is there an inherent reason for that? Or would it be possible to change those checks to e.g.

#if defined(_M_X64) || defined(_M_IX86)

(and similarly __i386__ if not MSVC) to allow these extensions on x86?

At least for my application this change works without any problems and greatly improves performance for decoding PNGs which is what I use Wuffs for.

nigeltao commented 2 years ago

Is there an inherent reason for that?

The reason is that I don't use x86 (32-bit) or MSVC day-to-day, so I was conservative. We could possibly change it to

#if defined(_M_X64) || defined(_M_IX86)

although some care might be needed, if other code (PNG-related or otherwise) is depending on the 64-ness that WUFFS_BASE__CPU_ARCH__X86_64 currently guarantees.

Out of curiousity, and if profiling is not too much work for you, do you know which Wuffs functions get greatly improved performance once you define WUFFS_BASE__CPU_ARCH__X86_64 for your 32-bit MSVC-compiled program?

kugelrund commented 2 years ago

Sure, the main differences for me are basically

wuffs_png__decoder__filter_4_distance_4_x86_sse42 being called instead of wuffs_png__decoder__filter_4_distance_4_fallback
wuffs_png__decoder__filter_4_distance_3_x86_sse42 being called instead of wuffs_png__decoder__filter_4_distance_3_fallback
wuffs_adler32__hasher__up_x86_sse42 being called instead of wuffs_adler32__hasher__up__choosy_default
wuffs_crc32__ieee_hasher__up_x86_avx2 being called instead of wuffs_crc32__ieee_hasher__up__choosy_default

The filter_4_distance_4 function takes the majority of the time for me, so that being faster is the main win. Perhaps I should have mentioned that I need RGBA output, I suppose that determines which filter function is necessary.

Here is an example profile (with only wuffs functions listed), before adding || defined(_M_IX86):

Function Name                                          Total CPU [unit, %]  Self CPU [unit, %]
wuffs_png__decoder__decode_frame                           31681 (75,78 %)         1 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle                     19626 (46,95 %)         0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default     19619 (46,93 %)        30 ( 0,07 %)
wuffs_png__decoder__filter_4                               18324 (43,83 %)         9 ( 0,02 %)
wuffs_png__decoder__filter_4_distance_4_fallback           16389 (39,20 %)     16387 (39,20 %)
wuffs_png__decoder__decode_pass                            12053 (28,83 %)         1 ( 0,00 %)
wuffs_zlib__decoder__transform_io                          11598 (27,74 %)         3 ( 0,01 %)
wuffs_deflate__decoder__transform_io                       10147 (24,27 %)         0 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks                      10109 (24,18 %)         2 ( 0,00 %)
wuffs_deflate__decoder__decode_huffman_fast32               9793 (23,43 %)      9515 (22,76 %)
wuffs_png__decoder__filter_4_distance_3_fallback            1928 ( 4,61 %)      1928 ( 4,61 %)
wuffs_adler32__hasher__update_u32                           1447 ( 3,46 %)         1 ( 0,00 %)
wuffs_adler32__hasher__up                                   1446 ( 3,46 %)         0 ( 0,00 %)
wuffs_adler32__hasher__up__choosy_default                   1446 ( 3,46 %)      1446 ( 3,46 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice  1263 ( 3,02 %)         5 ( 0,01 %)
wuffs_crc32__ieee_hasher__update_u32                         566 ( 1,35 %)         4 ( 0,01 %)
wuffs_crc32__ieee_hasher__up                                 562 ( 1,34 %)         0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up__choosy_default                 562 ( 1,34 %)       562 ( 1,34 %)
wuffs_base__pixel_swizzler__bgrw__bgr                        510 ( 1,22 %)       451 ( 1,08 %)
wuffs_deflate__decoder__init_dynamic_huffman                 306 ( 0,73 %)        86 ( 0,21 %)
wuffs_deflate__decoder__init_huff                            257 ( 0,61 %)       257 ( 0,61 %)
wuffs_deflate__decoder__init_fixed_huffman                    40 ( 0,10 %)         2 ( 0,00 %)

And after adding || defined(_M_IX86):

Function Name                                          Total CPU [unit, %]  Self CPU [unit, %]
wuffs_png__decoder__decode_frame                           16978 (63,48 %)         0 ( 0,00 %)
wuffs_png__decoder__decode_pass                            10508 (39,29 %)         1 ( 0,00 %)
wuffs_zlib__decoder__transform_io                          10437 (39,02 %)         1 ( 0,00 %)
wuffs_deflate__decoder__transform_io                       10087 (37,71 %)         1 ( 0,00 %)
wuffs_deflate__decoder__decode_blocks                      10036 (37,52 %)         2 ( 0,01 %)
wuffs_deflate__decoder__decode_huffman_fast32               9712 (36,31 %)      9389 (35,10 %)
wuffs_png__decoder__filter_and_swizzle                      6468 (24,18 %)         0 ( 0,00 %)
wuffs_png__decoder__filter_and_swizzle__choosy_default      6466 (24,18 %)        23 ( 0,09 %)
wuffs_png__decoder__filter_4                                5203 (19,45 %)         7 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_4_x86_sse42           4232 (15,82 %)      4231 (15,82 %)
wuffs_base__pixel_swizzler__swizzle_interleaved_from_slice  1237 ( 4,62 %)         8 ( 0,03 %)
wuffs_png__decoder__filter_4_distance_3_x86_sse42            964 ( 3,60 %)       964 ( 3,60 %)
wuffs_base__pixel_swizzler__bgrw__bgr                        515 ( 1,93 %)       442 ( 1,65 %)
wuffs_adler32__hasher__up                                    349 ( 1,30 %)         0 ( 0,00 %)
wuffs_adler32__hasher__update_u32                            349 ( 1,30 %)         0 ( 0,00 %)
wuffs_adler32__hasher__up_x86_sse42                          349 ( 1,30 %)       349 ( 1,30 %)
wuffs_deflate__decoder__init_dynamic_huffman                 316 ( 1,18 %)        78 ( 0,29 %)
wuffs_deflate__decoder__init_huff                            283 ( 1,06 %)       283 ( 1,06 %)
wuffs_crc32__ieee_hasher__update_u32                          92 ( 0,34 %)         7 ( 0,03 %)
wuffs_crc32__ieee_hasher__up                                  85 ( 0,32 %)         0 ( 0,00 %)
wuffs_crc32__ieee_hasher__up_x86_avx2                         85 ( 0,32 %)        85 ( 0,32 %)
wuffs_deflate__decoder__init_fixed_huffman                    50 ( 0,19 %)         1 ( 0,00 %)

nigeltao commented 2 years ago

It's fixed for release/c/wuffs-unsupported-snapshot.c. It'll be fixed in release/c/wuffs-v0.3.c whenever the next beta (version 0.3.0-beta.10) gets minted.

In case anyone else from the future is trying something similar, I presume you are configuring MSVC with /arch:AVX.

kugelrund commented 2 years ago

Works great, thank you very much.

In case anyone else from the future is trying something similar, I presume you are configuring MSVC with /arch:AVX.

Yes indeed.

google / wuffs

Support for SSE/AVX when targeting x86? #53