AltiVec blur routine (M509052)

GoogleCodeExporter commented 9 years ago

Mozilla is working on an SSE2 version of -moz-box-shadow. This could yield 
significant performance for us too.

https://bugzilla.mozilla.org/show_bug.cgi?id=509052

Original issue reported on code.google.com by classi...@floodgap.com on 6 Nov 2012 at 9:33

Blocking: #73

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 6 Nov 2012 at 9:33

Now blocking: #73

GoogleCodeExporter commented 9 years ago

I've got that almost working. There are some nasty endianness issues in there - 
and the code depends on a vector division function provided by Apple's vecLib 
framework and hence isn't portable to other OSes.
It's working well in many cases but in some the bottom part of the blurring 
isn't correct.

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 1:15

Changed state: Started

GoogleCodeExporter commented 9 years ago

Interestingly the currently used Blur function wasn't meant to be used on big 
endian architectures; there's a fallback in Blur.cpp which was meant to be used 
inconditionally on big endian machines. But that doesn't work anymore as 
IS_BIG_ENDIAN and WORDS_BIGENDIAN aren't defined anymore since the NSPR 
dependencies have been removed from many files - and by deleting the include(s) 
for the NSPR headers that used to define those macros there already have been 
quite some regressions of endianness issues (see the AuroraFox patches; at 
least for one of those issues the fix will not land in Aurora until 20).
In this case I think I already found some nearly invisible artifacts in the 
blurring; I noticed them when I compared the output (zoomed to the maximum) 
from the vectorized code with that from the non-vectorized one.

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 1:28

GoogleCodeExporter commented 9 years ago

I filed a bug with mozilla (M818004) and a patch that makes the C code work 
correctly on big endian machines.

The vectorized version is working now, some optimizations are still possible. 
Here the first working version. I don't feel a difference in speed though, it 
might even be slower.

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 1:53

Attachments:

altivec_acceleration_for_Blur.patch

GoogleCodeExporter commented 9 years ago

The actual code was missing. So here it comes.

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 2:12

Attachments:

altivec_acceleration_for_Blur.patch

GoogleCodeExporter commented 9 years ago

We're doing a lot of unaligned loads to get the image data into vectors, and we 
know this is worse on AltiVec than SSE, so maybe that's part of the problem. 
(I'm not sure if that's avoidable, given the SSE version.)

Original comment by classi...@floodgap.com on 4 Dec 2012 at 2:28

GoogleCodeExporter commented 9 years ago

That's true. And looking at that part of the code I already see how to 
significantly accelerate the unaligned load the usual way (overhead of only one 
initial extra load, four in this case as there are for data sequences to 
process).

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 2:37

GoogleCodeExporter commented 9 years ago

Btw, on your Mozilla bug, you should set the review back to Tim Terriberry if 
you want him to take one more pass, but since he already gave you an r+ I would 
just ask someone to push it to try.

Original comment by classi...@floodgap.com on 4 Dec 2012 at 2:37

GoogleCodeExporter commented 9 years ago

There's something going wrong when I try to optimize the unaligned load 
sequences...

Original comment by Tobias.N...@gmail.com on 4 Dec 2012 at 6:22

GoogleCodeExporter commented 9 years ago

I fear we'll mark this "won't fix".

Speeding up the unaligned loads turned out to be too branchy. The actual 
problem is that the fast unaligned loading doesn't work for aligned locations. 
And alignment often changes during the loop. Testing for alignment at the 
points where it can change and doing aligned and unaligned loads as needed 
(deciding four times each loop iteration) turned out to make the function crawl 
incredibly, probably because of the branchiness. We're treating four sequences 
of data, one part of them has to be loaded each loop iteration. Each of those 
sequences may change alignment individually at certain points of the loop.

A compromise solution that does one additional load in case the data turns out 
to be aligned (most often it's indeed unaligned) turns out to take at least 2 
times as much time as the non-vectorized version.

Another version that takes that whole problematic part from the non-vectorized 
version (leaving the part for generating the integral image vectorized) is 
nearly as fast as (but consistently slower than) the completely non-vectorized 
code.

I think the code needs to do too few things with the data so that the overhead 
for loading and storing is too big. Basically it's just loading, adding and 
storing the data, plus some merging, packing and shifting. And last but maybe 
not least there's that call to the Apple vecLib vector division function...

I attached a patch that is to be applied after the one from comment 5. It 
contains the code for the most successful of the different versions I described 
above.

Original comment by Tobias.N...@gmail.com on 5 Dec 2012 at 1:02

Attachments:

Altivec_Blur_testing.patch

GoogleCodeExporter commented 9 years ago

Does throwing a register declaration on any of the stuff in the loops help any? 
I would think we could get more registers in play.

Original comment by classi...@floodgap.com on 5 Dec 2012 at 2:33

GoogleCodeExporter commented 9 years ago

I left that out this time to see whether gcc does that right by itself (it 
should). What I've seen so far it does that well. I'll have a look at the 
assembler code...

I finally found a fast way to make the accelerated unaligned load version 
compatible with aligned locations. It's still at best half as fast as the 
non-vectorized version. Maybe the division is too slow.

Original comment by Tobias.N...@gmail.com on 5 Dec 2012 at 7:58

Attachments:

Altivec_Blur_testing.patch

GoogleCodeExporter commented 9 years ago

After some hacking it's now sometime faster than the non-vectorized version. 
I've implemented a rather poor but fast divison taken from Apple's sample code. 
It isn't very accurate and sometimes returns 0x0 instead of 0xFF but after all, 
it's just for that visual effect. I hope it can be optimized more (in speed and 
maybe in quality)!
The problem with the division function from vecLib framework was that the call 
to it made gcc use 8 - 10 global vector registers that had to be saved to the 
stack before and restored after execution. I didn't find a way to avoid that 
but by implementing a different (and faster) one.

Original comment by Tobias.N...@gmail.com on 7 Dec 2012 at 10:06

GoogleCodeExporter commented 9 years ago

When using the vecLib framework division function it has to use the global 
registers in order to save all vectors the have to survive the external 
function call. In case of the optimized unaligned loads that are at least 8 
vectors. In case of non-optimized unaligend loads no vectors have have to 
survive the function call as it doesn't use the results of the previous 
unaligned loads but always generates/loads all necessary vectors.

The remaining division problem (0x0 instead of 0xFF) is most probably caused by 
conversion from unsigned to signed. The intrinsic used for fast division is 
implemented for signed short vectors only.

Original comment by Tobias.N...@gmail.com on 8 Dec 2012 at 12:02

GoogleCodeExporter commented 9 years ago

I think of doing the conversion from unsigned to signed by right-shifting the 
unsigned value one bit to the right before the division and left-shifting it 
afterwards. We'd loose the least significant bit but would preserve the most 
significant one. Sounds better than limiting the unsigned value to the highest 
signed value.

Original comment by Tobias.N...@gmail.com on 8 Dec 2012 at 5:50

GoogleCodeExporter commented 9 years ago

It seems 15 bit integers are not accurate enough for that case. Have to resort 
to floating point division then.

Original comment by Tobias.N...@gmail.com on 9 Dec 2012 at 1:01

GoogleCodeExporter commented 9 years ago

Can't get it faster than the scalar version.
Have now implemented optimized and non-optimized unaligned loads without 
showing much difference although the optimized version needs far less vector 
loads.
There is also the option of doing the division as integer or as floating point 
division, not showing much difference either.
There are also some issues with inlining. When inlined the optimized unaligned 
loads use global vectors that have to be saved to the stack and restored, and 
when not inlined it doesn't work correctly (although it did at some point).
So here my final results.

Original comment by Tobias.N...@gmail.com on 15 Dec 2012 at 11:29

Attachments:

altivec_acceleration_for_Blur.patch

GoogleCodeExporter commented 9 years ago

Are we concluding that this isn't worth it then?

Original comment by classi...@floodgap.com on 15 Dec 2012 at 6:52

GoogleCodeExporter commented 9 years ago

Loop unrolling is the only thing left to try.

Original comment by Tobias.N...@gmail.com on 15 Dec 2012 at 7:21

GoogleCodeExporter commented 9 years ago

By aggressive unrolling of one particular loop (there are LOTS of them in 
there, but this is the only one where you can actually save some steps when 
unrolled) I managed to get it faster than the unvectorized version by quite 
some percent!
Now I wonder whether unvectorizing the other loops could make it even faster...

There's still the final division loop remaining to be unrolled. That part is 
currently done non-vectorized because of slowness of the vectorized version. I 
don't expect much though.

Original comment by Tobias.N...@gmail.com on 19 Dec 2012 at 11:00

Attachments:

altivec_acceleration_for_Blur.patch

GoogleCodeExporter commented 9 years ago

Division loop now unrolled as well and that final part is now slightly faster 
than the unvectorized version.
Performance win is 5-40% in numbers, though I don't really feel any big 
difference on the testing page http://blog.seanmartell.com/brands/ .
For very small areas to be blurred like the blinking cursor (does this really 
need blurring?) the vectorized version is 50% slower, but that's also true for 
the vectorized character conversions. The absolute amount of time lost is very 
small in those cases so that shouldn't matter.

The attached patch does still contain the benchmark code (ifdefed out), but 
otherwise I consider it mature and ready for the unstable branch.

There's still some possibility for improvement in that while generating the 
integral image, which is the part where we really gain speed, the potentially 
unaligned data is first copied to aligned locations and then loaded into a 
vector. That might see some more speed improvement by using the optimized 
unaligned loads in the same way they are done in the final division loop.

Original comment by Tobias.N...@gmail.com on 19 Dec 2012 at 11:03

Attachments:

altivec_acceleration_for_Blur.patch

GoogleCodeExporter commented 9 years ago

I'm going to put this in my tree once I'm done with all the base stuff.

Original comment by classi...@floodgap.com on 21 Dec 2012 at 4:47

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 21 Dec 2012 at 4:47

Added labels: Milestone-Unstable

GoogleCodeExporter commented 9 years ago

Changing Factory::HasVMX to be hardcoded based on TENFOURFOX_VMX, since I 
crossbuild for G3. Obviously not a concern for AuroraFox.

Original comment by classi...@floodgap.com on 12 Jan 2013 at 8:35

GoogleCodeExporter commented 9 years ago

Actually, scrolling performance is smoother than it was before. Not massively, 
but it does feel nicer. So I'd still call it a win.

Original comment by classi...@floodgap.com on 12 Jan 2013 at 8:52

GoogleCodeExporter commented 9 years ago

I still want to optimize that unaligned loads that are currently done by 
copying to aligned locations first. As that's the most important part of that 
whole thing I expect some more speed from that.

Original comment by Tobias.N...@gmail.com on 12 Jan 2013 at 9:06

GoogleCodeExporter commented 9 years ago

Let's do that in a followup issue and close this one.

Original comment by classi...@floodgap.com on 13 Jan 2013 at 5:30

Changed state: Verified

keeleysam / tenfourfox

AltiVec blur routine (M509052) #189