PowerPC Power8 VSX SIMD optimized filter functions

edelsohn commented 8 years ago

Implement SIMD vector support for Power8 VSX SIMD equivalent to libpng support for SIMD optimizations for Intel SSE, ARM Neon, and MIPS MSA. Achieve speedup appropriate for PowerPC VSX vector width and processor pipeline.

significant-bit commented 7 years ago

I'm taking a whack at this. @glennrp what is the recommended path to get external contributions into libpng?

glennrp commented 7 years ago

You may submit a GIT pull request or just email me a patch.

dmiller423 commented 7 years ago

Is code in process of being accepted? Or still open?

significant-bit commented 7 years ago

The issue is still open, but I'm pretty far along on local hardware. Will publish early next week & start the review process.

barkovv commented 7 years ago

@edelsohn Should I implement runtime detection of supporting VSX by cpu? If yes please tell me platforms which must be supported (currently linux is the only supported one).

edelsohn commented 7 years ago

Does libpng support runtime detection of SSE4 vs AVX vs AVX2 vs AVX512?

The intended target for the issue and bounty is PPC64 Linux support, not other OSes.

barkovv commented 7 years ago

@edelsohn There is no runtime detection support for Intel but there is ones for mips and ARM. Now linux support for ppc64 has been made, so I guess it is okay for this issue.

edelsohn commented 7 years ago

Does runtime support work for PPC64 on Linux? I'm unsure how to interpret your comment about "linux support for ppc64 has been made"

barkovv commented 7 years ago

@edelsohn I mean runtime detection of "if current CPU is able to run Altivec && VSX code" . Yes, it is done for PPC64 Linux

edelsohn commented 7 years ago

@barkovv @glennrp What is the final speed up and how does this compare with other SIMD architectures of similar width, such as ARM Neon and Intel SSE?

barkovv commented 7 years ago

@edelsohn According to John Bowler's words:

Analysis of the results suggests a 1/74% speed up from using the Altivec code; this is suspiciously large.
Final results: about 8% improvement in (just) decode time of typical PNGs

edelsohn commented 7 years ago

I saw that in the pull request.

What is the speedup for other SIMD architectures for the same measurements? Is the POWER VSX code achieving equivalent or better speedup?

8% improvement for decode seems small, but I don't know what to compare it against.

barkovv commented 7 years ago

@edelsohn I've just made tests for Intel. There are some quick results:

libpng-noopt $ ./timepng ../Earth10k.png 
1.519951620
libpng-opt $ ./timepng ../Earth10k.png 
1.192934654

Calculations:

>>> 1.192934654/1.509859932
0.79009623920532

So, according to timepng benchmarking, PowerPC VSX optimisation is on pair with Intel SSE (74% and 79%).

barkovv commented 7 years ago

I can measure this results deeper and with more accuracy if @jbowler will provide some more information about his measure methods and scripts.

glennrp commented 7 years ago

I'm running Earth10k.png through pngcrush for timing. It's not surprising that optimization will make a significant improvement because the image has mostly AVG and PAETH filtering.

edelsohn commented 7 years ago

I'm not questioning the result. I greatly appreciate the excellent work on the patch to implement POWER8 VSX optimizations. The milestone for this issue is not some perfect, unrealistic, magical improvement in performance nor some super-human effort to achieve an order of magnitude advantage. The goal for the issue is the implementation of architecture-specific SIMD optimizations for POWER8 VSX that are equivalent in form to Intel SSE, with the hope of achieving equivalent speedup.

I want to reach out to the experts for this important library to get their assessment of the benefit produced by the patch. Is the improvement produced by this patch about right?

jbowler commented 7 years ago

On Wed, Feb 22, 2017 at 8:27 AM, Vadim Barkov notifications@github.com wrote:

I can measure this results deeper and with less accuracy if @jbowler https://github.com/jbowler will provide some more information about his measure methods and scripts.

No scripts; contrib/libtests/timepng.c run on a suitable test assembly. I described the approach in the pull request. The sample images were described years ago and reflect the PNG files to be found on the web but excluding ones under 9x9 pixels. The test probably over-emphasises large PNG files for web pages; I don't think most web pages use more than a couple of PNGs bigger than 16x16, but most of my test set ended up being around 64x64.

A quick way of getting a test set, one we've used before, is to grab all the PNG files in your browser cache; it works just so long as you haven't loaded a page of PNG test images recently ;-)

-- John Bowler john.cunningham.bowler@gmail.com +1 (541) 450-9885 PO BOX 3151 KERBY OR 97531-3151 USA

barkovv commented 7 years ago

@edelsohn I don't understand you. My point is that PowerPC VSX optimization are equivalent to Intel SSE one. Do I need to approve it by profiling? Maybe you want to @glennrp and @jbowler to approve it? What kind of actions must be perfomed to approve this work?

edelsohn commented 7 years ago

@barkovv I want to know if @glennrp agrees with the test methodology. Again, I am not trying to create new hurdles. I simply want to be able to point at an objective test recommended by the libpng community that says, "yes, this is good enough."

jbowler commented 7 years ago

On Wed, Feb 22, 2017 at 9:44 AM, David Edelsohn notifications@github.com wrote:

The goal for the issue is the implementation of architecture-specific SIMD optimizations for POWER8 VSX that are equivalent in form to Intel SSE, with the hope of achieving equivalent speedup.

Find the test set Intel uses, test it; if your implementation isn't any faster develop a new test set. It is much less effort to change the test than it is to beat a test designed to produce a specific result.

You can prove anything with statistics; you just have to phrase the question in the correct way. Google "Darell Huff" and "statistics"; you will find a book my father gave me to read around 1976 when I started doing PM+Stats (A level) at school. I suspect he didn't understand it, but I did and it has been far more useful to me than anything I learnt at school.

-- John Bowler john.cunningham.bowler@gmail.com +1 (541) 450-9885 PO BOX 3151 KERBY OR 97531-3151 USA

glennrp commented 7 years ago

Here are my results for Earth10k.png with an instrumented pngcrush. Note that there is some speedup, but a much larger payoff is obtained by running "pngcrush -speed" on the file to avoid using the AVG and PAETH filters.

There is nothing surprising here.

Glenn Timing tests for firefox linux earth10k (not optimized) Linux dlib-debian-le-1 4.9.0-1-powerpc64le #1 SMP Debian 4.9.6-3 (2017-01-28) ppc64le GNU/Linux gcc (Debian 4.9.2-10) 4.9.2 Earth10k.png CPU time decode 2.633772, total 2.801144 sec Earth10k_fast.png CPU time decode 1.898273, total 2.006459 sec Earth10k_slow.png CPU time decode 2.601996, total 2.712496 sec

Timing tests for firefox linux earth10k (powerpc-vsx optimized) Linux dlib-debian-le-1 4.9.0-1-powerpc64le #1 SMP Debian 4.9.6-3 (2017-01-28) ppc64le GNU/Linux gcc (Debian 4.9.2-10) 4.9.2 Earth10k.png CPU time decode 2.458045, total 2.623693 sec Earth10k_fast.png CPU time decode 1.841627, total 1.947361 sec Earth10k_slow.png CPU time decode 2.434269, total 2.544447 sec

-rw-r--r-- 1 debian debian 72360235 Feb 22 16:53 Earth10k.png -rw-r--r-- 1 debian debian 63285602 Feb 22 18:03 Earth10k_fast.png -rw-r--r-- 1 debian debian 66843433 Feb 22 18:13 Earth10k_slow.png

Earth10k.png was provided by debian Earth10k_fast is from "pngcrush -brute -speed -force Earth10k.png" Earth10k_slow is from "pngcrush -force Earth10k.png"

On Wed, Feb 22, 2017 at 1:27 PM, John Bowler notifications@github.com wrote:

On Wed, Feb 22, 2017 at 9:44 AM, David Edelsohn notifications@github.com wrote:

The goal for the issue is the implementation of architecture-specific SIMD optimizations for POWER8 VSX that are equivalent in form to Intel SSE, with the hope of achieving equivalent speedup.

Find the test set Intel uses, test it; if your implementation isn't any faster develop a new test set. It is much less effort to change the test than it is to beat a test designed to produce a specific result.

You can prove anything with statistics; you just have to phrase the question in the correct way. Google "Darell Huff" and "statistics"; you will find a book my father gave me to read around 1976 when I started doing PM+Stats (A level) at school. I suspect he didn't understand it, but I did and it has been far more useful to me than anything I learnt at school.

-- John Bowler john.cunningham.bowler@gmail.com +1 (541) 450-9885 <(541)%20450-9885> PO BOX 3151 KERBY OR 97531-3151 USA

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glennrp/libpng/issues/136#issuecomment-281757610, or mute the thread https://github.com/notifications/unsubscribe-auth/ABe25repU7QkQTey-TAGiWJ8wp4DZgPbks5rfH4hgaJpZM4J9wVI .

pnggroup / libpng

PowerPC Power8 VSX SIMD optimized filter functions #136