veluca93 / fpnge

Demo of a fast PNG encoder.
Apache License 2.0
81 stars 8 forks source link

Remove PCLMUL requirement for CRC computation #12

Closed animetosho closed 2 years ago

animetosho commented 2 years ago

This reduces the base ISA requirement to just SSE4.1 (reducing this to SSSE3 shouldn't be too hard, but not something I need).
CRC32 is done via slice-by-8 algorithm if PCLMUL isn't available.

This does also change the CLMUL implementation so that it doesn't do final reduction until the hash is needed.

CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

google-cla[bot] commented 2 years ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

veluca93 commented 2 years ago

This reduces the base ISA requirement to just SSE4.1 (reducing this to SSSE3 shouldn't be too hard, but not something I need). CRC32 is done via slice-by-8 algorithm if PCLMUL isn't available.

This does also change the CLMUL implementation so that it doesn't do final reduction until the hash is needed.

CLA response: I release these changes to the public domain subject to the CC0 license (https://creativecommons.org/publicdomain/zero/1.0/).

Given that by Steam HW survey, just ~1% of people do not have SSE4.1, I'd consider that a reasonable target.

Can we have a benchmark of old/new/sse41?

animetosho commented 2 years ago

The 12700K I'm testing on has a fast PCLMUL implementation, so probably not the most representative, but here you go:

Old code - image 1
   270.227 MP/s
    10.770 bits/pixel
Old code - image 2
   328.211 MP/s
    16.239 bits/pixel

New code (SSE4+PCLMUL) - image 1
   272.260 MP/s
    10.770 bits/pixel
New code (SSE4+PCLMUL) - image 2
   327.647 MP/s
    16.239 bits/pixel

New code (SSE4) - image 1
   244.609 MP/s
    10.770 bits/pixel
New code (SSE4) - image 2
   270.416 MP/s
    16.239 bits/pixel