Closed GoogleCodeExporter closed 9 years ago
Original comment by classi...@floodgap.com
on 6 Nov 2012 at 9:33
I've got that almost working. There are some nasty endianness issues in there -
and the code depends on a vector division function provided by Apple's vecLib
framework and hence isn't portable to other OSes.
It's working well in many cases but in some the bottom part of the blurring
isn't correct.
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 1:15
Interestingly the currently used Blur function wasn't meant to be used on big
endian architectures; there's a fallback in Blur.cpp which was meant to be used
inconditionally on big endian machines. But that doesn't work anymore as
IS_BIG_ENDIAN and WORDS_BIGENDIAN aren't defined anymore since the NSPR
dependencies have been removed from many files - and by deleting the include(s)
for the NSPR headers that used to define those macros there already have been
quite some regressions of endianness issues (see the AuroraFox patches; at
least for one of those issues the fix will not land in Aurora until 20).
In this case I think I already found some nearly invisible artifacts in the
blurring; I noticed them when I compared the output (zoomed to the maximum)
from the vectorized code with that from the non-vectorized one.
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 1:28
I filed a bug with mozilla (M818004) and a patch that makes the C code work
correctly on big endian machines.
The vectorized version is working now, some optimizations are still possible.
Here the first working version. I don't feel a difference in speed though, it
might even be slower.
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 1:53
Attachments:
The actual code was missing. So here it comes.
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 2:12
Attachments:
We're doing a lot of unaligned loads to get the image data into vectors, and we
know this is worse on AltiVec than SSE, so maybe that's part of the problem.
(I'm not sure if that's avoidable, given the SSE version.)
Original comment by classi...@floodgap.com
on 4 Dec 2012 at 2:28
That's true. And looking at that part of the code I already see how to
significantly accelerate the unaligned load the usual way (overhead of only one
initial extra load, four in this case as there are for data sequences to
process).
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 2:37
Btw, on your Mozilla bug, you should set the review back to Tim Terriberry if
you want him to take one more pass, but since he already gave you an r+ I would
just ask someone to push it to try.
Original comment by classi...@floodgap.com
on 4 Dec 2012 at 2:37
There's something going wrong when I try to optimize the unaligned load
sequences...
Original comment by Tobias.N...@gmail.com
on 4 Dec 2012 at 6:22
I fear we'll mark this "won't fix".
Speeding up the unaligned loads turned out to be too branchy. The actual
problem is that the fast unaligned loading doesn't work for aligned locations.
And alignment often changes during the loop. Testing for alignment at the
points where it can change and doing aligned and unaligned loads as needed
(deciding four times each loop iteration) turned out to make the function crawl
incredibly, probably because of the branchiness. We're treating four sequences
of data, one part of them has to be loaded each loop iteration. Each of those
sequences may change alignment individually at certain points of the loop.
A compromise solution that does one additional load in case the data turns out
to be aligned (most often it's indeed unaligned) turns out to take at least 2
times as much time as the non-vectorized version.
Another version that takes that whole problematic part from the non-vectorized
version (leaving the part for generating the integral image vectorized) is
nearly as fast as (but consistently slower than) the completely non-vectorized
code.
I think the code needs to do too few things with the data so that the overhead
for loading and storing is too big. Basically it's just loading, adding and
storing the data, plus some merging, packing and shifting. And last but maybe
not least there's that call to the Apple vecLib vector division function...
I attached a patch that is to be applied after the one from comment 5. It
contains the code for the most successful of the different versions I described
above.
Original comment by Tobias.N...@gmail.com
on 5 Dec 2012 at 1:02
Attachments:
Does throwing a register declaration on any of the stuff in the loops help any?
I would think we could get more registers in play.
Original comment by classi...@floodgap.com
on 5 Dec 2012 at 2:33
I left that out this time to see whether gcc does that right by itself (it
should). What I've seen so far it does that well. I'll have a look at the
assembler code...
I finally found a fast way to make the accelerated unaligned load version
compatible with aligned locations. It's still at best half as fast as the
non-vectorized version. Maybe the division is too slow.
Original comment by Tobias.N...@gmail.com
on 5 Dec 2012 at 7:58
Attachments:
After some hacking it's now sometime faster than the non-vectorized version.
I've implemented a rather poor but fast divison taken from Apple's sample code.
It isn't very accurate and sometimes returns 0x0 instead of 0xFF but after all,
it's just for that visual effect. I hope it can be optimized more (in speed and
maybe in quality)!
The problem with the division function from vecLib framework was that the call
to it made gcc use 8 - 10 global vector registers that had to be saved to the
stack before and restored after execution. I didn't find a way to avoid that
but by implementing a different (and faster) one.
Original comment by Tobias.N...@gmail.com
on 7 Dec 2012 at 10:06
When using the vecLib framework division function it has to use the global
registers in order to save all vectors the have to survive the external
function call. In case of the optimized unaligned loads that are at least 8
vectors. In case of non-optimized unaligend loads no vectors have have to
survive the function call as it doesn't use the results of the previous
unaligned loads but always generates/loads all necessary vectors.
The remaining division problem (0x0 instead of 0xFF) is most probably caused by
conversion from unsigned to signed. The intrinsic used for fast division is
implemented for signed short vectors only.
Original comment by Tobias.N...@gmail.com
on 8 Dec 2012 at 12:02
I think of doing the conversion from unsigned to signed by right-shifting the
unsigned value one bit to the right before the division and left-shifting it
afterwards. We'd loose the least significant bit but would preserve the most
significant one. Sounds better than limiting the unsigned value to the highest
signed value.
Original comment by Tobias.N...@gmail.com
on 8 Dec 2012 at 5:50
It seems 15 bit integers are not accurate enough for that case. Have to resort
to floating point division then.
Original comment by Tobias.N...@gmail.com
on 9 Dec 2012 at 1:01
Can't get it faster than the scalar version.
Have now implemented optimized and non-optimized unaligned loads without
showing much difference although the optimized version needs far less vector
loads.
There is also the option of doing the division as integer or as floating point
division, not showing much difference either.
There are also some issues with inlining. When inlined the optimized unaligned
loads use global vectors that have to be saved to the stack and restored, and
when not inlined it doesn't work correctly (although it did at some point).
So here my final results.
Original comment by Tobias.N...@gmail.com
on 15 Dec 2012 at 11:29
Attachments:
Are we concluding that this isn't worth it then?
Original comment by classi...@floodgap.com
on 15 Dec 2012 at 6:52
Loop unrolling is the only thing left to try.
Original comment by Tobias.N...@gmail.com
on 15 Dec 2012 at 7:21
By aggressive unrolling of one particular loop (there are LOTS of them in
there, but this is the only one where you can actually save some steps when
unrolled) I managed to get it faster than the unvectorized version by quite
some percent!
Now I wonder whether unvectorizing the other loops could make it even faster...
There's still the final division loop remaining to be unrolled. That part is
currently done non-vectorized because of slowness of the vectorized version. I
don't expect much though.
Original comment by Tobias.N...@gmail.com
on 19 Dec 2012 at 11:00
Attachments:
Division loop now unrolled as well and that final part is now slightly faster
than the unvectorized version.
Performance win is 5-40% in numbers, though I don't really feel any big
difference on the testing page http://blog.seanmartell.com/brands/ .
For very small areas to be blurred like the blinking cursor (does this really
need blurring?) the vectorized version is 50% slower, but that's also true for
the vectorized character conversions. The absolute amount of time lost is very
small in those cases so that shouldn't matter.
The attached patch does still contain the benchmark code (ifdefed out), but
otherwise I consider it mature and ready for the unstable branch.
There's still some possibility for improvement in that while generating the
integral image, which is the part where we really gain speed, the potentially
unaligned data is first copied to aligned locations and then loaded into a
vector. That might see some more speed improvement by using the optimized
unaligned loads in the same way they are done in the final division loop.
Original comment by Tobias.N...@gmail.com
on 19 Dec 2012 at 11:03
Attachments:
I'm going to put this in my tree once I'm done with all the base stuff.
Original comment by classi...@floodgap.com
on 21 Dec 2012 at 4:47
Original comment by classi...@floodgap.com
on 21 Dec 2012 at 4:47
Changing Factory::HasVMX to be hardcoded based on TENFOURFOX_VMX, since I
crossbuild for G3. Obviously not a concern for AuroraFox.
Original comment by classi...@floodgap.com
on 12 Jan 2013 at 8:35
Actually, scrolling performance is smoother than it was before. Not massively,
but it does feel nicer. So I'd still call it a win.
Original comment by classi...@floodgap.com
on 12 Jan 2013 at 8:52
I still want to optimize that unaligned loads that are currently done by
copying to aligned locations first. As that's the most important part of that
whole thing I expect some more speed from that.
Original comment by Tobias.N...@gmail.com
on 12 Jan 2013 at 9:06
Let's do that in a followup issue and close this one.
Original comment by classi...@floodgap.com
on 13 Jan 2013 at 5:30
Original issue reported on code.google.com by
classi...@floodgap.com
on 6 Nov 2012 at 9:33