Convert NSPR and Chromium mutexes to kernel atomic operations

GoogleCodeExporter commented 9 years ago

Between 10 and 17, certain sites regressed mightily in performance on MP 
systems. The most egregious was ifixit, which renders fine on a 1GHz iMac G4 
and horribly slowly on a quad G5. Shark shows almost 47% of runtime spent 
waiting for semaphore locks. Mozilla changed JS from its own lock to NSPR locks 
in this timeframe; JS used to use libkern ATOMIC locks.

Stubbing the NSPR locks in JS to call libkern ATOMIC locks again, plus adding a 
small customized atomic set routine (ATOMIC in 10.4 does not have an atomic 
set, just an atomic compare-and-swap), restores almost all of the performance. 
This is being shipped in 17.0.

We should simply replace NSPR locks with libkern locks in the whole system, not 
just JavaScript, which should overcome this bottleneck particularly in NSS 
which uses them heavily. In particular, we should replace PR_ATOMIC_INCREMENT, 
_DECREMENT, _ADD and _SET with OSAtomicIncrement32Barrier, 
OSAtomicDecrement32Barrier, OSAtomicAdd32Barrier and TenFourFoxAtomicSet, and 
move TenFourFoxAtomicSet out of JS and into NSPR. This should land on 17.0.1 
and 18. It will probably marginally help uniprocessor systems, but it will 
definitely help multiprocessor machines.

Original issue reported on code.google.com by classi...@floodgap.com on 16 Nov 2012 at 3:40

Blocking: #193

GoogleCodeExporter commented 9 years ago

Mozilla is trying to purge locks in JS, period, in Fx19: 
https://bugzilla.mozilla.org/show_bug.cgi?id=772722

Still, we still want this for the rest of the browser.

Original comment by classi...@floodgap.com on 16 Nov 2012 at 4:50

GoogleCodeExporter commented 9 years ago

And, looks like bug 675078 was to blame, which didn't reland until earlier this 
year. https://bugzilla.mozilla.org/show_bug.cgi?id=675078

Wonder why we didn't notice this in Fx12. Anyway.

Original comment by classi...@floodgap.com on 16 Nov 2012 at 4:54

GoogleCodeExporter commented 9 years ago

Undoubtedly some of this is also due to the poor performance of PR_Lock, which 
is implemented in terms of pthread_mutex and this is a known bad operator on OS 
X. So let's limit this to NSPR atomic operations for now.

Original comment by classi...@floodgap.com on 17 Nov 2012 at 2:36

Changed title: NSPR atomics are inefficient on multiprocessor systems

GoogleCodeExporter commented 9 years ago

And, for that matter, the Atomics are also internally implemented in terms of 
pthread_mutex_lock, so simply removing Atomics from the rest of the source 
should improve things. There's a lot of PR_Lock in hashtables though.

Original comment by classi...@floodgap.com on 17 Nov 2012 at 2:40

GoogleCodeExporter commented 9 years ago

Hmm. In pratom.h,

#elif ((__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)) && \

So it's only because we're still on gcc 4.0.1. This should "just work" on gcc 
4.6+. Thus this needs to only be done to 17.

Original comment by classi...@floodgap.com on 17 Nov 2012 at 3:10

Added labels: Milestone-NextStable

GoogleCodeExporter commented 9 years ago

Here's another site that threads crash into each other on: www.kfiam640.com

Original comment by classi...@floodgap.com on 18 Nov 2012 at 2:03

GoogleCodeExporter commented 9 years ago

After patching this out, overhead drops a bit, but we're still spending ~40% of 
our time in semaphore lock on the affected sites. Shark looks like this:

    0.0%    95.2%   libSystem.B.dylib   _pthread_body   
    0.0%    90.5%   libnspr4.dylib   PR_Select  
    0.0%    61.9%   XUL   XRE_AddStaticComponent    
    0.0%    57.1%   XUL    0x3eef784 [176B] 
    0.0%    57.1%   XUL     XRE_AddStaticComponent  
    0.0%    47.6%   XUL      XRE_AddStaticComponent 
    0.0%    42.9%   libnspr4.dylib        PR_Wait   
    0.0%    42.9%   libnspr4.dylib         PR_WaitCondVar   
    0.0%    42.9%   libSystem.B.dylib           pthread_cond_wait   
    42.9%   42.9%   libSystem.B.dylib            semaphore_wait_signal_trap 
    0.0%    4.8%    libnspr4.dylib        PR_WaitCondVar    
    0.0%    4.8%    XUL      NS_CycleCollectorSuspect_P 
    0.0%    4.8%    XUL      0x299ec58 [788B]   
    0.0%    4.8%    libnspr4.dylib     PR_WaitCondVar   
    0.0%    9.5%    XUL   0x29b2afc [884B]  
    0.0%    4.8%    XUL   XRE_StartupTimelineRecord 
    0.0%    4.8%    XUL   js_GetScriptLineExtent(JSScript*) 
    0.0%    4.8%    XUL   js::IterateChunks(JSRuntime*, void*, void (*)(JSRuntime*, void*, js::gc::Chunk*)) 
    0.0%    4.8%    XUL   js::IndirectWrapper::toWrapper()  
    0.0%    4.8%    XUL  std::deque<FilePath, std::allocator<FilePath> >::_M_push_back_aux(FilePath const&) 
    0.0%    4.8%    firefox start

Original comment by classi...@floodgap.com on 19 Nov 2012 at 3:37

GoogleCodeExporter commented 9 years ago

After review, we actually do want to keep using OS X internal atomics even 
after the compiler switch because OS X self-optimizes the atomics on 
uniprocessor macs and the gcc intrinsics don't. We will also have G3/G4 and G5 
versions of the atomic set (G3/G4 using eieio/isync, G5 using lwsync/isync).

Original comment by classi...@floodgap.com on 20 Nov 2012 at 6:17

GoogleCodeExporter commented 9 years ago

The fully barriered AtomicSet reduces V8 by nearly 10%. That's bad, and we can 
actually test JavaScript. So let's create a non-barrier AtomicSet and use that, 
and give JavaScript non-barriered OS X atomics also, and keep the memory 
barrier version for the rest of the browser.

Original comment by classi...@floodgap.com on 20 Nov 2012 at 2:53

GoogleCodeExporter commented 9 years ago

Initial version done for 17.0.1pre

Original comment by classi...@floodgap.com on 24 Nov 2012 at 10:48

Changed state: Started

GoogleCodeExporter commented 9 years ago

Did you notice that the existing but (currently) unused implementations of the 
NSPR atomics are non barrier versions? (see 
nsprpub/pr/src/md/unix/os_Darwin_ppc.s)
And there is also an implementation wrapped around OSAtomic functions, without 
barriers as well. It's ifdef'ed to be used on ARM CPUs only. This one is 
actually in use - only ppc (32 bit only), i386 and x86_64 use the gcc 
intrinsics instead of the existing machine dependent implementations.

I really can't see (int the Aurora 19 sources) how even with gcc 4.0.1 on 10.4 
NSPR would build the PR_Lock code path. There must be something going wrong in 
nsprpub/pr/include/md/_darwin.h where _PR_HAVE_ATOMIC_OPS gets defined which 
should cause the Atomics in nsprpub/pr/src/md/unix/os_Darwin_ppc.s to be used .

The attached patch makes NSPR use OSAtomic functions for all OS X versions from 
10.4 on independent of the architecture, the way it was already implemented for 
ARM CPUs.
Furthermore it provides implementations of PR_StackPush and PR_StackPop copied 
and adapted from their AIX counterpart, getting rid of a nasty dependency on 
PR_Lock.
I wonder though whether the non barrier version is really the appropriate one. 
I made PR_StackPush and PR_StackPop use the barrier version because the AIX 
version does explicitely use the barrier versions while non barrier versions 
are also available there.

So far the browser works as always and performance is good. PR_StackPush and 
PR_StackPop are used quite frequently from what I see when setting breakpoints 
in them.

Original comment by Tobias.N...@gmail.com on 26 Nov 2012 at 10:37

Attachments:

OSAtomic.patch

GoogleCodeExporter commented 9 years ago

I'll definitely take the PR_Stack* ops, anything to reduce the use of PR_Lock. 
I'll have to study them later to see how they work, but I'm all for anything 
that gets us out of mutex hell.

I'm not as happy with the _MD_ATOMIC_SET: I'd rather use our assembler ones 
because we know exactly what is getting emitted. It doesn't get a lot simpler 
than lwarx/stwcx. in a loop. But if the browser seems okay with a non-barrier 
atomic_set, we could just use that everywhere. (It seems to be fine, but 
there's a lot of atomic usage in NSS and I really don't want to get that wrong 
-- maybe the best solution is to try to get NSPR and NSS testing working at the 
same time.)

Original comment by classi...@floodgap.com on 26 Nov 2012 at 10:48

GoogleCodeExporter commented 9 years ago

The advantage of _MD_ATOMIC_SET as it is implemented upstream is however that 
the kernel does its multi-/uniprocessor optimization. I knew you won't be too 
happy with it.
The optimizations to AtomicSet should definitely go into 
nsprpub/pr/src/md/unix/os_Darwin_ppc.s:__PR_DarwinPPC_AtomicSet .
My patch also gets the NSPR tests working here. Just "(g)make runtests" in 
{build root}/nsprpub/pr/tests .

Original comment by Tobias.N...@gmail.com on 26 Nov 2012 at 11:01

GoogleCodeExporter commented 9 years ago

That's true (about the optimizations) but if we can get away with no barriers 
everywhere, then we don't need it because we'll already be working barrier-free.

I know about os_Darwin_ppc.s, but for some reason the build system was not 
picking up my changes, maybe a bug, which is why I implemented it in 
os_Darwin.s instead. I'll have to mess with that.

Looks like the PR_Stack* usage comes from NSPR's file descriptor cache, so that 
makes sense they get used a lot.

Original comment by classi...@floodgap.com on 26 Nov 2012 at 11:06

GoogleCodeExporter commented 9 years ago

As we're already dealing with functions in the so called comm page. The 
objective-C/C++ direct dispatch flag isn't fully implemented in gcc-4.6 and 
later. It generates calls to objc_msgSend_Fast and objc_assign_ivar_Fast which 
are just stubs that Apple gcc replaces with local branches to 0xFFFEFF00 or 
0xFFFEFEC0. I already looked into that and it seems that can be implemented as 
builtins.

Original comment by Tobias.N...@gmail.com on 26 Nov 2012 at 11:28

GoogleCodeExporter commented 9 years ago

Is that needed to compile? I'm still ripping out the 4.0.1 stuff from the 17 
changesets, so I'm at least a week or two away from even trying to build 19.

Original comment by classi...@floodgap.com on 26 Nov 2012 at 11:32

GoogleCodeExporter commented 9 years ago

No it's not needed as long as you don't pass the flag -fobjc-direct-dispatch to 
gcc. Direct dispatching was only used in the 32 bit Objective-C runtime so that 
might be the reason it wasn't implemented in current gcc. It's just a matter of 
performance once again.

Original comment by Tobias.N...@gmail.com on 26 Nov 2012 at 11:36

GoogleCodeExporter commented 9 years ago

Ah, ok. We should spin that off into a separate issue.

Original comment by classi...@floodgap.com on 26 Nov 2012 at 11:38

GoogleCodeExporter commented 9 years ago

I'll open an issue once I know more - maybe I'll first try to get the WebKit 
java script interpreter working...
The browser is working really well. Almost feels full speed although the 
machine is building Aurora in the background. Haven't tested yet without the 
build running.

Original comment by Tobias.N...@gmail.com on 26 Nov 2012 at 11:44

GoogleCodeExporter commented 9 years ago

Failing lots of tests, though I don't know what the state was prior to this.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 4:18

GoogleCodeExporter commented 9 years ago

(so far failing bigfile3, dlltest {even with your patch}, getai, libfilename, 
mbcs, and hangs up on nameshm1)

Original comment by classi...@floodgap.com on 27 Nov 2012 at 4:25

GoogleCodeExporter commented 9 years ago

Semaphores are failing bigtime; they just hang. I need to back this out and see 
if it passed to begin with.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 4:38

GoogleCodeExporter commented 9 years ago

Nope, they don't. In fact, they just hang until they time out. This might have 
been the whole problem!

Hacking pr/src/pthreads/ptsynch.c, it seems that sem_otime never gets set to a 
non-zero value (but sem_ctime does). From bug 469744 
https://bugzilla.mozilla.org/show_bug.cgi?id=469744, it doesn't look like 
semaphores were *ever* properly tested on 10.4. Changing it to sem_ctime causes 
it to succeed, and all the semaphore tests now pass.

https://bugzilla.mozilla.org/show_bug.cgi?id=370420 implies that failure of 
some tests is a normal phenomenon, but that really sucks.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 5:19

GoogleCodeExporter commented 9 years ago

Apparently I'm not going insane: 
http://lists.apple.com/archives/darwin-dev/2005/Mar/msg00147.html

Original comment by classi...@floodgap.com on 27 Nov 2012 at 5:25

GoogleCodeExporter commented 9 years ago

After thinking it over, the ctime hack is going to lead to a race condition 
sooner or later. Since we have the choice of POSIX or SysV semaphores in 10.4, 
changed md/_darwin.h to POSIX semaphores, and the semaphore tests all pass. I'm 
going to leave the rest alone for now. I restored the PR_Stack* ops and we'll 
see how the browser behaves when it's done building.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 5:47

GoogleCodeExporter commented 9 years ago

I forgot to mention the failing tests - I didn't mind the failures because the 
browser is working well obviously.
And I have to "mske clean" after doing any change to NSPR sources since 
dependency tracking doesn't seem to work there.

Original comment by Tobias.N...@gmail.com on 27 Nov 2012 at 9:47

GoogleCodeExporter commented 9 years ago

Initial tests look good (with the change to PR_OpenSemaphore and your PR_Stack* 
ops, and removing all memory barriers). Obviously a lot less hanging in 
semaphores too. KFI is now bearable; I need to throw it through Shark again to 
see what's left to shake out, but this does look better. I'm going to do a more 
thorough test on it after work.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 2:33

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Looking into the xnu sources shows that the sem_otime bug was fixed in 10.5 
(http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-792.6.70 vs 
http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-1228).
Though there's a difference between using POSIX and SYSV semaphores in the test 
results. With SYSV semaphores ranfile waits infinitely in pthread_wait_cond and 
with POSIX semaphores nonblock fails.

Original comment by Tobias.N...@gmail.com on 27 Nov 2012 at 9:49

GoogleCodeExporter commented 9 years ago

No, those two tests sometimes fail and sometimes pass independently of SYSV or 
POSIX semaphores. So both do work on 10.5 .

Original comment by Tobias.N...@gmail.com on 27 Nov 2012 at 9:52

GoogleCodeExporter commented 9 years ago

If you use POSIX semaphores, does the browser still work on 10.5 or do you get 
stalls? It's going to be really hard to ship a browser that's dynamically 
selectable.

Original comment by classi...@floodgap.com on 27 Nov 2012 at 9:52

GoogleCodeExporter commented 9 years ago

(nvm, you just answered my question)

Original comment by classi...@floodgap.com on 27 Nov 2012 at 9:52

GoogleCodeExporter commented 9 years ago

When you search the whole source code for OSAtomic you will find quite some 
usages of those functions - and all of them (except the upcoming unused 
jemalloc 3) use the barrier versions.
And that said, on an architecture with out of order instruction scheduling a 
lock is a true lock only when it deals with that issue. For locking purposes 
you have to pay the price for using multiple CPUs and do the syncing and memory 
barriers.

I also found out that chromium IPC uses OSAtomic only on Intel OS X - I changed 
that to apply to OS X systems independent of the CPU architecture.
I was surprised to find find out that cairo apparently uses pthread mutexes 
although there is code to use system atomics but the necessary defines aren't 
done anywhere. Is cairo used for anything more than a azure backend fallback?

Original comment by Tobias.N...@gmail.com on 27 Nov 2012 at 11:59

GoogleCodeExporter commented 9 years ago

I'll fix the Chromium code. Cairo is used heavily for graphics still -- we 
don't have Azure content yet. Maybe we need to add atomic support there too.

However, assuming this quad G5 is the worst case for CPU on Power Mac, I 
haven't had any problems banging on it this afternoon. Web workers and multiple 
tabs loading are all rock solid. However, PR_Wait is still showing up a lot on 
kfi, so we haven't fixed its core issue though we've made a lot of other things 
a lot better.

Original comment by classi...@floodgap.com on 28 Nov 2012 at 1:23

GoogleCodeExporter commented 9 years ago

Actually, let's use barrier for Chromium IPC. If anything would break with a 
non-barrier sequence, it would be IPC.

Original comment by classi...@floodgap.com on 28 Nov 2012 at 1:37

GoogleCodeExporter commented 9 years ago

Well, here is what I did to the Chromium IPC.

Original comment by Tobias.N...@gmail.com on 28 Nov 2012 at 1:44

Attachments:

chromium.patch

GoogleCodeExporter commented 9 years ago

Why does the x86.h file need to be deleted?

Original comment by classi...@floodgap.com on 28 Nov 2012 at 1:47

GoogleCodeExporter commented 9 years ago

No, I renamed it (just removed x86 from its name because it's nonsense) but 
that's not obvious in diffs.

Original comment by Tobias.N...@gmail.com on 28 Nov 2012 at 1:49

GoogleCodeExporter commented 9 years ago

I finally found out that in gfx/cairo/cairo/src/cairo-platform.h CAIRO_NO_MUTEX 
is defined which short circuits all mutexes in cairo. So there should be any 
problem in cairo at all.

Original comment by Tobias.N...@gmail.com on 28 Nov 2012 at 1:51

GoogleCodeExporter commented 9 years ago

Too bad about Cairo. Did you make any changes to the x86 file? I might just 
simplify it for testing by just including it, and we can tidy it up later (not 
that Chromium changes a lot, but just to reduce merge burden).

Original comment by classi...@floodgap.com on 28 Nov 2012 at 1:54

GoogleCodeExporter commented 9 years ago

No changes; just including the x86 header on ppc machines will work in that 
case.

Original comment by Tobias.N...@gmail.com on 28 Nov 2012 at 1:59

GoogleCodeExporter commented 9 years ago

I reimplemented PR_StackPush/Pop using OSSpinLock after having examined the 
kernel code for that - it was made exactly for that purpose, but it applies 
memory barriers and does syncing on SMP systems.

#include <libkern/OSAtomic.h>

static OSSpinLock stackLock = OS_SPINLOCK_INIT;

PR_IMPLEMENT(void)
PR_StackPush(PRStack *stack, PRStackElem *stack_elem)
{
    OSSpinLockLock(&stackLock);
    stack_elem->prstk_elem_next = stack->prstk_head.prstk_elem_next;
    stack->prstk_head.prstk_elem_next = stack_elem;
    OSSpinLockUnlock(&stackLock);
    return;
}

PR_IMPLEMENT(PRStackElem *)
PR_StackPop(PRStack *stack)
{
    PRStackElem *element;

    OSSpinLockLock(&stackLock);
    element = stack->prstk_head.prstk_elem_next;
    if (element != NULL) {
        stack->prstk_head.prstk_elem_next = element->prstk_elem_next;
        element->prstk_elem_next = NULL;        /* debugging aid */
    }
    OSSpinLockUnlock(&stackLock);
    return element;
}

Original comment by Tobias.N...@gmail.com on 28 Nov 2012 at 2:20

GoogleCodeExporter commented 9 years ago

Landed spinlock change and Chromium atomics. Looks good so far.

Original comment by classi...@floodgap.com on 28 Nov 2012 at 4:04

Changed title: Convert NSPR and Chromium mutexes to kernel atomic operations

GoogleCodeExporter commented 9 years ago

Okay, I'm pretty satisfied with how it's acting. QTE works fine, SSL and SPDY 
look good. Flipping over to build 7450. If that's good too, we'll build 7400 
and G3 and make this a second beta.

Original comment by classi...@floodgap.com on 28 Nov 2012 at 4:56

GoogleCodeExporter commented 9 years ago

I don't know if it's caused by this bug, but 
https://productforums.google.com/forum/#!categories/websearch loads veeeeery 
slowly (ca. 50 s.) in TFF 17.0 on my Mac 7450.

Original comment by jerzyglowacki on 30 Nov 2012 at 2:27

GoogleCodeExporter commented 9 years ago

No. The new Google Groups (which includes the product forums) is just extremely 
JavaScript heavy. In fact, the changes above do improve its performance, though 
it is still slow.

Original comment by classi...@floodgap.com on 30 Nov 2012 at 2:40

GoogleCodeExporter commented 9 years ago

No problems with that site here on Aurora 19 on a G4-7450@1.5GHz , even when 
it's building mozilla in the background.

Original comment by Tobias.N...@gmail.com on 30 Nov 2012 at 8:33

GoogleCodeExporter commented 9 years ago

Interesting this ancient mozilla bug: 
https://bugzilla.mozilla.org/show_bug.cgi?id=194339
Though our solution is far more portable - and NSPR itself is rather poorly 
maintained.

Original comment by Tobias.N...@gmail.com on 30 Nov 2012 at 5:32

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 30 Nov 2012 at 6:56

Now blocking: #193

GoogleCodeExporter commented 9 years ago

Yeah, their code is actually a very poor performer because it has a sync right 
in the middle of it. Yours is much higher performance. Anyway, I'm going to 
close this since we're shipping the code, and look in more detail at the 
semaphore wait lag in issue 193.

Original comment by classi...@floodgap.com on 30 Nov 2012 at 6:57

Changed state: Verified

keeleysam / tenfourfox

Convert NSPR and Chromium mutexes to kernel atomic operations #191