veluca93 / fpnge

Demo of a fast PNG encoder.
Apache License 2.0
88 stars 8 forks source link

Encode speed optimizations (2) #9

Closed animetosho closed 2 years ago

animetosho commented 2 years ago

Second batch of optimizations. These shouldn't affect the output in any way.

Most of this is an implementation of WriteBits using the PEXT instruction, which is disabled on CPUs with a slow implementation of it (AMD CPUs before Zen3).

Comparison on a 12700K:

Old code - image 1
   295.585 MP/s
    10.787 bits/pixel
Old code - image 2
   384.460 MP/s
    16.240 bits/pixel

New code - image 1
   302.302 MP/s
    10.787 bits/pixel
New code - image 2
   397.900 MP/s
    16.240 bits/pixel

CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

google-cla[bot] commented 2 years ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

veluca93 commented 2 years ago

Looks good to me! I wonder if at this point we shouldn't be using a vtable-like approach and do dynamic dispatch depending on the cpu arch...

(as a side note, I optimized this on a Threadripper CPU, and you certainly can feel the slowness of BMI2 there...)

animetosho commented 2 years ago

Dynamic CPU dispatch would indeed be nice. It's mostly SSE vs AVX+BMI2, with the PEXT exclusion being specific to some AMD CPUs.

BMI2 is generally fine on AMD CPUs - it's just that PDEP/PEXT instructions were microcoded before Zen3, so you need to be careful with those. BMI2 does provide better variable shifts, though AMD CPUs have generally been good at that, even without BMI2.