nothings / stb

stb single-file public domain libraries for C/C++
https://twitter.com/nothings
Other
25.77k stars 7.66k forks source link

stb resize 2.08 #1649

Open jeffrbig2 opened 3 weeks ago

jeffrbig2 commented 3 weeks ago

fix for RGB->BGR three channel flips and add SIMD (thanks to Ryan Salsbury) fix for sub-rect resizes use pragmas to control unrolling when they are available.

ryanrsrs commented 3 weeks ago

I test this change on my Raspberry Pi 4B running Raspberry OS in 32-bit mode: $ uname -a Linux raspberrypi 6.6.31+rpt-rpi-v7l #1 SMP Raspbian 1:6.6.31-1+rpt1 (2024-05-29) armv7l GNU/Linux

The color bug I noticed in stbirsimple_flip_3ch() is fixed, in both scalar and simd paths. On my platform, stbirsimdf_swiz2 is not defined and it selects the second SIMD code block, using stbir__simdf_swiz().

The change in speed from enabling SIMD is slight (but consistent). I have verified which code paths are executing using printfs.

With gcc, SIMD gave a 15% speedup. With clang, SIMD gave a 3% slowdown.

GCC build options: cc -std=gnu11 -Wall -I/usr/include/libdrm -Os -march=native -DSTBIR_USE_FMA -mfpu=neon-vfpv4 -mfp16-format=ieee -Wno-unused-function -c stb_impl.c

Clang build options: clang -std=gnu11 -Wall -I/usr/include/libdrm -Os -march=native -DSTBIR_USE_FMA -mfpu=neon-vfpv4 -Wno-unused-function -c stb_impl.c

The fastest version, Clang with -DSTBIR_NO_SIMD (lol), performs as follow: src: 6048 x 8064 dst: 900 x 1200 time: 1.003 seconds

I'm not sure why it's so slow since it's only 150 MB of pixels. Maybe the long scanlines are thrashing the cache in a maximally-bad way?

The speed is fine for my application, and matches the 2.07 non-SIMD speed, so I dunno if there's a problem. But if you expected a bigger difference on this platform, I can poke at it some more.

e: All times mentioned above are for the call to stbir_resize_extended(), which does much more work than just flip_3ch(). But even the core resizer math doesn't speed up with SIMD, really? Maybe I am doing something wrong here.

e2: just rechecked 2.08 times against 2.07, both scalar and SIMD. They're the same. So this does not seem like a regression, just something I noticed now, since I am comparing simd and not-simd back-to-back to see that the color was fixed in both.

jeffrbig2 commented 3 weeks ago

That's a reasonably big downsample (depending on your filter) - 1 second doesn't seem nuts for a 32-bit platform that is reading 150 MB of input with a sample window of 27x20 (each output pixel has to read 27x20 of the input). 32-bit vs 64-bit is a huge hit here, btw. There are a couple things you can do:

1) throw threads at it - this is a linear speed up - 2x cores, half the time. 2) use linear pixel format - STBIR_TYPE_UINT8 instead of STBIR_TYPE_UINT8_SRGB 3) don't use wrap edge mode 4) use a simpler filter, STBIR_FILTER_BOX or STBIR_FILTER_TRIANGLE. 5) to make better cache use, break the resize into vertical stripes (use the stbir_set_pixel_subrect function to do 128 vertical output pixels at a time). This will usually save 25% to 50%.

For option 5, you can also wait for 2.09 which will internally do the cache striping for you.

But yeah, 32-bit arm is just pretty darn pokey in general.

ryanrsrs commented 3 weeks ago

Yep, I'm not complaining about the performnace, I just wanted to be check the numbers seemed sensible.

The application I'm testing is decode and display of 45MP iPhone 15 heic files on a Rasp Pi Zero 2 W 512MB. (It works!)

jeffrbig2 commented 3 weeks ago

There's probably some more wins if you want to get fancy. Instead of decoding the HEIC into RGB and then resizing that, decode into YUV (where the U and V planes are smaller), resize those planes, and THEN convert to RGB in the smaller space.