[meta] identify, look up and triage targets for SIMD AltiVec conversion

GoogleCodeExporter commented 9 years ago

Spinoff of issue 51 and issue 64. This will be a master bug from which others 
will be descended for converting already extant SIMD code into AltiVec 
(preferably SSE2 since it is also 128-bit).

1. qcms SSE2: 512865
Don't know why Mozilla has so many colour space converters; here's another one. 
Guts in gfx/qcms/transform-sse2.c (sse1 is for straight SSE). This is used a 
lot, so it would be a nice win, but is moderately complex. Fortunately it is 
already written as intrinsics.

2. gfx/thebes/gfxAlphaRecoverySSE2.cpp: 587936
Mostly for plugins. Probably not worth it to us.

3. content/base/src/nsTextFragmentSSE2.cpp; is-ascii in nsTextFragment could be 
AltiVec'd: 585978
        essentially looks to see if each short is > 255. need to
        verify there isn't an endian problem here. vec_any_ge is the
        intrinsic we want, as it returns 1 for anything out of range.
        this is SSE2 so the same algorithm will work (same bit width).
        PRUnichar is currently defined as an unsigned short (not wchar_t).

4. intl/uconv/src/nsUTF8ToUnicodeSSE2.cpp: 506430
This pretty much speeds up anything that does this conversion, which is pretty 
much everything. It already prealigns to 16 bytes, which should favour AltiVec 
heavily.

5. LossyConvertEncoding could be AltiVec'd: 586698
        this could be problematic with unaligned stores; the SSE2
        version explicitly stores unaligned. A first pass from Apple:
// Mostly safe to use with aligned and unaligned addresses
void StoreUnaligned( vector unsigned char src, unsigned char *target )
{
    vector unsigned char MSQ, LSQ, edges;
    vector unsigned char edgeAlign, align;

    MSQ = vec_ld(0, target); // most significant quadword
    LSQ = vec_ld(15, target); // least significant quadword
    edgeAlign = vec_lvsl(0, target); // permute map to extract edges
    edges=vec_perm(LSQ,MSQ,edgeAlign); // extract the edges
    align = vec_lvsr( 0, target ); // permute map to misalign data
    MSQ = vec_perm(edges,src,align); // misalign the data (MSQ)
    LSQ = vec_perm(src,edges,align); // misalign the data (LSQ)
    vec_st( LSQ, 15, target ); // Store the LSQ part first
    vec_st( MSQ, 0, target ); // Store the MSQ part
}
However, since they are contiguously misaligned, this is not helpful. We
need to get the first misaligned store fixed, then align thereafter.
Might be too big for 6, but fortunately if it breaks anything it would
pretty much break everything. Also not thread-safe, but unlikely to be a
big issue for this code.

3 and 4 seem good to start with.

Original issue reported on code.google.com by classi...@floodgap.com on 21 Jun 2011 at 10:24

Blocking: #208
Blocked on: #51, #64, #75, #76, #87, #98, #109, #163, #166, #189, #250, #251, #268

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 10 Jul 2011 at 11:30

GoogleCodeExporter commented 9 years ago

#3 and #4 are now issue 75 and issue 76.

Original comment by classi...@floodgap.com on 11 Jul 2011 at 1:01

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 30 Aug 2011 at 3:32

Changed state: Started

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Can you spin that off to a separate issue? I'll look at it when I get a chance, 
but let's keep this as a meta issue only.

Original comment by classi...@floodgap.com on 12 Sep 2011 at 6:54

GoogleCodeExporter commented 9 years ago

#1 is now issue 87.

Original comment by Tobias.N...@gmail.com on 12 Sep 2011 at 9:35

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 14 Sep 2011 at 6:51

GoogleCodeExporter commented 9 years ago

I think I'll give LossyConvertEncoding a try. It's converted easily and we 
simply might do everythin unvectorized when the destination buffer happens to 
be unaligned.

Original comment by Tobias.N...@gmail.com on 13 Oct 2011 at 10:20

GoogleCodeExporter commented 9 years ago

Feel free -- my main project right now is methodjit, so it's all yours if you 
have an algorithm in mind. Spin off another issue and block this one on it.

Original comment by classi...@floodgap.com on 14 Oct 2011 at 4:17

GoogleCodeExporter commented 9 years ago

Original comment by Tobias.N...@gmail.com on 16 Oct 2011 at 12:37

GoogleCodeExporter commented 9 years ago

Original comment by Tobias.N...@gmail.com on 16 Oct 2011 at 12:43

GoogleCodeExporter commented 9 years ago

Original comment by Tobias.N...@gmail.com on 16 Oct 2011 at 1:14

GoogleCodeExporter commented 9 years ago

LossyConvertEncoding is now in issue 98

Original comment by Tobias.N...@gmail.com on 16 Oct 2011 at 1:15

GoogleCodeExporter commented 9 years ago

I've started to port all that Altivec acceleration routines to linux. Obviously 
accelereated JPEG decoding cannot be ported.

To do runtime Altivec detection based on CPU capabilities I use an Altivec 
detection function borrowed from freevec.org .

If we'd do the same for Mac OS X there would no longer be a need to 
differentiate between G3 and G4-7400/7410 builds since apart from Altivec those 
two are treated identically by gcc. Would that be worth it?

Original comment by Tobias.N...@gmail.com on 5 Nov 2011 at 10:00

GoogleCodeExporter commented 9 years ago

Doesn't 7400 have different scheduling?

Original comment by classi...@floodgap.com on 5 Nov 2011 at 11:27

GoogleCodeExporter commented 9 years ago

There's one sole file in gcc describing scheduling for 750 and 7400: 
"config/rs6000/7xx.md"

Apart from Altivec only the dmul opcode (should be multiplication of two 
doubles) has different scheduling.
That difference shouldn't be noticable.

Original comment by Tobias.N...@gmail.com on 6 Nov 2011 at 12:11

GoogleCodeExporter commented 9 years ago

Okay. I'm going to do a little research on it.

I'm not opposed to it in principle, and certainly having one less build to test 
and generate is a plus, but I also want to make sure it won't trip up our less 
techy users. The 7400 vs 7450 issue is already confusing to beginners, though I 
think most people have figured it out by now.

For the moment let's continue with compile-time determination and then revisit 
this.

Original comment by classi...@floodgap.com on 6 Nov 2011 at 12:15

GoogleCodeExporter commented 9 years ago

The linux port is now working well - linux runtime Altivec detection also 
working fine.
I didn't yet port our changes to libvpx but I'm quite sure it will need to be 
PIC compatible for real so I'll delay that. And debian doesn't use the in-tree 
version of libvpx anyway.

I also added the Apple recommended sysctl-based Altivec runtime detection for 
Mac OS X to the code but didn't test it yet. The way it's implemented at the 
moment Altivec support is always compiled for PowerPC builds but activated only 
based on the result of the runtime Altivec support detection. The detection 
itself is only run once. That's the same behaviour as for x86 CPUs.

Now should I file the changeset to mozilla and/or the debian package maintainer?

Original comment by Tobias.N...@gmail.com on 6 Nov 2011 at 2:59

GoogleCodeExporter commented 9 years ago

I'd file it with Mozilla; Debian will pick up the changes downstream. I'm not 
sure who the reviewer would be, though, since our stuff touches a large number 
of modules. I'd go one at a time for each one, possibly see who reviewed the 
SSE2 version and ask if they would be willing.

Original comment by classi...@floodgap.com on 6 Nov 2011 at 3:03

GoogleCodeExporter commented 9 years ago

(If you want, I can "pre-review" them to make sure all the nitpicky stuff is 
there, like license boilerplate and correct author attributions.)

Original comment by classi...@floodgap.com on 6 Nov 2011 at 3:08

GoogleCodeExporter commented 9 years ago

Yes, pre-reviewing would be nice!

And once the changes are accepted in mozilla-central it needs three releases 
until they reach the end users?

As you see in the attached patch file I modified the debian iceweasel-7.0.1 
packages and don't even know if the patch would apply to mozilla-central 
without problems.

Original comment by Tobias.N...@gmail.com on 6 Nov 2011 at 6:29

Attachments:

Altivec-enhancements.patch

GoogleCodeExporter commented 9 years ago

First, on the auto-select for us: it dawned on me there's a good reason to keep 
it as it is, and that is because the G3 build is guaranteed to have no AltiVec 
code in it even on an AltiVec-enabled system. This gives us a reference build 
that will always run AltiVec free. So temporarily I'm going to short circuit 
the code. However, I think you *should* submit them to Mozilla with that in 
them because they like single builds and their SSE2 and NEON code has runtime 
detection. The only thing I'd make sure of is that the build doesn't crap out 
on a G3 if at all possible (i.e., a G3 should still be able to build it).

On the new files in your patch: I haven't observed house style in the ones I 
originated because I hadn't intended to send them to Mozilla at that time. On 
those, make sure of the following:

- I use CTRL-I tabs, but they generally like 4-space tabs
- Make sure the MPL boilerplate and vim sequence is at the beginning
- Proper attribution. The original code is not Mozilla Foundation code, it's 
either you, me or both, depending on which one of us worked on it.
- I put in funny comments but Mozilla doesn't share my sense of humour 
typically, so you might consider making these more "bland" :P

Original comment by classi...@floodgap.com on 7 Nov 2011 at 3:58

GoogleCodeExporter commented 9 years ago

Ooops, hit save too soon. When we pick up the runtime code, I'm going to 
replace it with #ifdefs in our local changesets, but we don't need to have 
TENFOURFOX_* specific stuff in the Moz tree, of course.

I don't know Debian's schedule. If they only follow mozilla-release, yes, it 
will take three cycles to get from mozilla-inbound to mozilla-release. We track 
mozilla-beta, so it typically takes two.

Original comment by classi...@floodgap.com on 7 Nov 2011 at 3:59

GoogleCodeExporter commented 9 years ago

By modifying xpcom/glue/VMX.h it should be very easy to short cut the 
detection. And as qcms has it's own pure C detection (for x86 it has as well) 
it'd have to be done there as well (gfx/qmcs/transform.c)

I did never test the Mac OS X detection code. Did you do so?

Unfortunately my external hard drive enclosure decided to die minutes after I 
uploaded the changeset. So now I have to get a new one.

And as the PB Wallstreet G4 upgrade board started to show some soldering 
problems it now has to run either with L2 cache switched off (very slow) or 
with the original G3 upgrade card (very slow as well), and therefore I cannot 
really use it as a build platform any more...

Original comment by Tobias.N...@gmail.com on 7 Nov 2011 at 8:30

GoogleCodeExporter commented 9 years ago

Wow, that sucks! :(

I have not, but it looks correct. Should be straightforward enough to wire that 
into a simple little .c test file. I'll do that a bit later, since it seems 
like your system is out of commission for right now.

Original comment by classi...@floodgap.com on 8 Nov 2011 at 2:48

GoogleCodeExporter commented 9 years ago

Another target which appeared in Fx9. Not as big a win but seems to be 
straightforward to implement.

Original comment by classi...@floodgap.com on 20 Nov 2011 at 3:04

GoogleCodeExporter commented 9 years ago

All five extant issues are landed in 9.

Original comment by classi...@floodgap.com on 23 Nov 2011 at 9:56

GoogleCodeExporter commented 9 years ago

IMO this is a more correct place for the problems with AltiVec text conversions 
in TFF 9.

What I implemented in a different way than the SSE versions of UTF8Utils and 
nsUTF8ToUnicode is that the pointers are only updated once upon exiting the 
functions. Changing that back to update them every loop iteration might be 
worth a try and might already be threadsafe - but I don't know much about 
threadsafety.

Original comment by Tobias.N...@gmail.com on 30 Nov 2011 at 5:38

GoogleCodeExporter commented 9 years ago

http://developer.apple.com/hardwaredrivers/ve/alignment.html implies that 
unaligned stores, being non-atomic, are not thread-safe. I guess we could mutex 
it, but that would be technically disgusting and may not be faster. I think we 
just have to hope that the OS gives us aligned storage in most cases unless you 
can think of another way.

Original comment by classi...@floodgap.com on 30 Nov 2011 at 5:47

GoogleCodeExporter commented 9 years ago

Because of that warning about unaligned stores there aren't any in our code (or 
am I wrong?) but instead we load from unaligned locations which is thread safe 
AFAIK,

Original comment by Tobias.N...@gmail.com on 30 Nov 2011 at 6:10

GoogleCodeExporter commented 9 years ago

Hmmm... that Apple doc also says the following in the section about unaligned 
stores:
"There is no way to do an atomic misaligned AltiVec load or store without 
falling back on other atomic synchronization primitives like a mutex or the 
operations in OSAtomic.h."
But in the section about unaligned loads this isn't mentioned and reading it 
you get the impression that unaligned reads are completely safe.

Original comment by Tobias.N...@gmail.com on 30 Nov 2011 at 6:24

GoogleCodeExporter commented 9 years ago

Found this fun way of detecting AltiVec without sysctl. A little scary, but 
clever. (Ignore the Gestalt code, it's just for comparison.)

+#if idppc_altivec && !MACOS_X
+/* This is the brute force way of detecting instruction sets...
+   the code is borrowed from SDL, which got the idea from the libmpeg2
+   library - thanks!
+ */
+#include <signal.h>
+#include <setjmp.h>
+static jmp_buf jmpbuf;
+static void illegal_instruction(int sig)
+{
+    longjmp(jmpbuf, 1);
+}
+#endif
+
 static void Sys_DetectAltivec(void)
 {
   // Only detect if user hasn't forcibly disabled it.
+  qboolean altivec = qfalse;
   if (com_altivec->integer) {
 #if idppc_altivec
+    #ifdef MACOS_X
     long feat = 0;
     OSErr err = Gestalt(gestaltPowerPCProcessorFeatures, &feat);
     if ((err=noErr) && ((1 << gestaltPowerPCHasVectorInstructions) & feat)) {
+        altivec = qtrue;
     }
+    #else
+    void (*handler)(int sig);
+    handler = signal(SIGILL, illegal_instruction);
+    if ( setjmp(jmpbuf) = 0 ) {
+        asm volatile ("mtspr 256, %0\n\t"
+                      "vand %%v0, %%v0, %%v0"
+                        :
+                        : "r" (-1));
+        altivec = qtrue;
+    }
+    signal(SIGILL, handler);
+    #endif
+
+    if (!altivec) {
+      Cvar_Set( "com_altivec", "0" );  // we don't have it! Disable support!
+    }
 #endif

Original comment by classi...@floodgap.com on 18 Jan 2012 at 12:32

GoogleCodeExporter commented 9 years ago

*from http://web.archiveorange.com/archive/v/QCRpizEPgO8aYzq2LPcG

Original comment by classi...@floodgap.com on 18 Jan 2012 at 12:33

GoogleCodeExporter commented 9 years ago

I had found this code as well but I also found out that that way of detecting 
Altivec can cause strange problems.

Original comment by Tobias.N...@gmail.com on 18 Jan 2012 at 4:05

GoogleCodeExporter commented 9 years ago

I agree it's a bit freaky, but I like the illegal instruction concept in 
theory. Probably something about the longjmp or improperly handling SIGILL.

Original comment by classi...@floodgap.com on 18 Jan 2012 at 4:16

GoogleCodeExporter commented 9 years ago

That's what warns about using SIGILL based Altivec detection:
https://bugzilla.redhat.com/show_bug.cgi?id=435771

Nevertheless the SIGILL version might work perfectly in TFF.

Original comment by Tobias.N...@gmail.com on 18 Jan 2012 at 4:46

GoogleCodeExporter commented 9 years ago

It's time to make libpixman even more awesome.

Original comment by classi...@floodgap.com on 5 Jun 2012 at 9:51

GoogleCodeExporter commented 9 years ago

libvpx IDCT has to be extended (or the plain C version would have to be used).

Original comment by Tobias.N...@gmail.com on 12 Jun 2012 at 10:22

Now blocked on: #166

GoogleCodeExporter commented 9 years ago

Original comment by Tobias.N...@gmail.com on 12 Jun 2012 at 10:23

GoogleCodeExporter commented 9 years ago

We might want to think about gfxAlphaRecovery due to bug 741682.

Original comment by classi...@floodgap.com on 23 Jul 2012 at 1:47

Now blocked on: #51, #64, #75, #76, #87, #98, #109, #163, #166

GoogleCodeExporter commented 9 years ago

I don't know why that happened (I guess Google Code changing blocking flags), 
sorry about the spam.

Original comment by classi...@floodgap.com on 23 Jul 2012 at 3:20

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 6 Nov 2012 at 9:33

Now blocked on: #189

GoogleCodeExporter commented 9 years ago

I began contributing our patches back to mozilla.
M817033 (https://bugzilla.mozilla.org/show_bug.cgi?id=817033) is the meta bug 
tracking all the others.

Original comment by Tobias.N...@gmail.com on 30 Nov 2012 at 9:13

GoogleCodeExporter commented 9 years ago

files that contain SSE SIMD fast paths we didn't port yet:
=== code that IS (probably) in use ===
 * gfx/2d/convolver.cpp (SSE2 code currently disabled; see comment)
 * gfx/2d/BlurSSE2.cpp (that's our issue 189, seems to be important)
 * gfx/2d/ImageScalingSSE2.cpp (seems to be important as well)
 * gfx/thebes/gfxAlphaRecoverySSE2.cpp (probably also important)

=== code that may get used in the future ===
 * media/libsoundtouch/src/sse_optimized.cpp (an upcoming sound processing library; might become important)

=== code that is not used for general browsing ===
 * media/libvorbis/lib/os.h (used to convert a double to an int)
 * media/libopus/celt/float_cast.h (float to int conversion once more)
 * media/webrtc/trunk/src/modules/audio_processing/aec/aec_core_sse2.c (webrtc is not important for general browser speed)
 * media/webrtc/trunk/src/modules/audio_processing/aec/aec_rdft*
 * media/webrtc/trunk/src/modules/video_processing/main/source/content_analysis_sse2.cc

=== code that isn't in use actually ===
 * gfx/angle/src/libGLESv2/TextureSSE2.cpp (even for WebGL I don't think ANGLE is used on OS X)
 * gfx/skia/src/opts/SkBitmapProcState_opts_SSE* (skia is not normally used on OS X)
 * gfx/skia/src/opts/SkBlitRect_opts_SSE*
 * gfx/skia/src/opts/SkBlitRow_opts_SSE*
 * gfx/skia/src/opts/SkUtils_opts_SSE*

Original comment by Tobias.N...@gmail.com on 30 Nov 2012 at 11:38

GoogleCodeExporter commented 9 years ago

Just noticed that gfx/thebes/gfxAlphaRecoverySSE2.cpp was already listed and 
considered less important; might that also be used for video rendering using 
GStreamer?

Original comment by Tobias.N...@gmail.com on 30 Nov 2012 at 11:54

GoogleCodeExporter commented 9 years ago

It's possible it's been plugged into other things; I haven't really checked 
recently. ImageScalingSSE2.cpp could be helpful, but I thought we used a 
different algorithm on OS X (I could be wrong about this).

Original comment by classi...@floodgap.com on 1 Dec 2012 at 12:01

GoogleCodeExporter commented 9 years ago

And now that I look at bug 741682, we *should* support AlphaRecovery since it's 
going to be part of DLBI going forward. So maybe that should be our next goal.

Original comment by classi...@floodgap.com on 1 Dec 2012 at 12:04

GoogleCodeExporter commented 9 years ago

AlphaRecovery is not currently used. The source file is compiled but the XUL 
shared library doesn't contain its symbols. It is indeed used in 
gfxWindowsNativeDrawing (Windows) and gfxXlibNativeRenderer (X11) only.
ImageScaling similarly used by DrawTargetD2D (hence Windows) only.
convolver is compiled and linked into XUL, at least on 10.5, but I couldn't 
find any call to it (it ends up being used indirectly by gfx/2d/Scale but that 
itself isn't called anywhere).
Blur is indeed used a lot.

Original comment by Tobias.N...@gmail.com on 1 Dec 2012 at 10:24

keeleysam / tenfourfox

[meta] identify, look up and triage targets for SIMD AltiVec conversion #73