Closed GoogleCodeExporter closed 9 years ago
Mozilla is trying to purge locks in JS, period, in Fx19:
https://bugzilla.mozilla.org/show_bug.cgi?id=772722
Still, we still want this for the rest of the browser.
Original comment by classi...@floodgap.com
on 16 Nov 2012 at 4:50
And, looks like bug 675078 was to blame, which didn't reland until earlier this
year. https://bugzilla.mozilla.org/show_bug.cgi?id=675078
Wonder why we didn't notice this in Fx12. Anyway.
Original comment by classi...@floodgap.com
on 16 Nov 2012 at 4:54
Undoubtedly some of this is also due to the poor performance of PR_Lock, which
is implemented in terms of pthread_mutex and this is a known bad operator on OS
X. So let's limit this to NSPR atomic operations for now.
Original comment by classi...@floodgap.com
on 17 Nov 2012 at 2:36
And, for that matter, the Atomics are also internally implemented in terms of
pthread_mutex_lock, so simply removing Atomics from the rest of the source
should improve things. There's a lot of PR_Lock in hashtables though.
Original comment by classi...@floodgap.com
on 17 Nov 2012 at 2:40
Hmm. In pratom.h,
#elif ((__GNUC__ > 4) || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)) && \
So it's only because we're still on gcc 4.0.1. This should "just work" on gcc
4.6+. Thus this needs to only be done to 17.
Original comment by classi...@floodgap.com
on 17 Nov 2012 at 3:10
Here's another site that threads crash into each other on: www.kfiam640.com
Original comment by classi...@floodgap.com
on 18 Nov 2012 at 2:03
After patching this out, overhead drops a bit, but we're still spending ~40% of
our time in semaphore lock on the affected sites. Shark looks like this:
0.0% 95.2% libSystem.B.dylib _pthread_body
0.0% 90.5% libnspr4.dylib PR_Select
0.0% 61.9% XUL XRE_AddStaticComponent
0.0% 57.1% XUL 0x3eef784 [176B]
0.0% 57.1% XUL XRE_AddStaticComponent
0.0% 47.6% XUL XRE_AddStaticComponent
0.0% 42.9% libnspr4.dylib PR_Wait
0.0% 42.9% libnspr4.dylib PR_WaitCondVar
0.0% 42.9% libSystem.B.dylib pthread_cond_wait
42.9% 42.9% libSystem.B.dylib semaphore_wait_signal_trap
0.0% 4.8% libnspr4.dylib PR_WaitCondVar
0.0% 4.8% XUL NS_CycleCollectorSuspect_P
0.0% 4.8% XUL 0x299ec58 [788B]
0.0% 4.8% libnspr4.dylib PR_WaitCondVar
0.0% 9.5% XUL 0x29b2afc [884B]
0.0% 4.8% XUL XRE_StartupTimelineRecord
0.0% 4.8% XUL js_GetScriptLineExtent(JSScript*)
0.0% 4.8% XUL js::IterateChunks(JSRuntime*, void*, void (*)(JSRuntime*, void*, js::gc::Chunk*))
0.0% 4.8% XUL js::IndirectWrapper::toWrapper()
0.0% 4.8% XUL std::deque<FilePath, std::allocator<FilePath> >::_M_push_back_aux(FilePath const&)
0.0% 4.8% firefox start
Original comment by classi...@floodgap.com
on 19 Nov 2012 at 3:37
After review, we actually do want to keep using OS X internal atomics even
after the compiler switch because OS X self-optimizes the atomics on
uniprocessor macs and the gcc intrinsics don't. We will also have G3/G4 and G5
versions of the atomic set (G3/G4 using eieio/isync, G5 using lwsync/isync).
Original comment by classi...@floodgap.com
on 20 Nov 2012 at 6:17
The fully barriered AtomicSet reduces V8 by nearly 10%. That's bad, and we can
actually test JavaScript. So let's create a non-barrier AtomicSet and use that,
and give JavaScript non-barriered OS X atomics also, and keep the memory
barrier version for the rest of the browser.
Original comment by classi...@floodgap.com
on 20 Nov 2012 at 2:53
Initial version done for 17.0.1pre
Original comment by classi...@floodgap.com
on 24 Nov 2012 at 10:48
Did you notice that the existing but (currently) unused implementations of the
NSPR atomics are non barrier versions? (see
nsprpub/pr/src/md/unix/os_Darwin_ppc.s)
And there is also an implementation wrapped around OSAtomic functions, without
barriers as well. It's ifdef'ed to be used on ARM CPUs only. This one is
actually in use - only ppc (32 bit only), i386 and x86_64 use the gcc
intrinsics instead of the existing machine dependent implementations.
I really can't see (int the Aurora 19 sources) how even with gcc 4.0.1 on 10.4
NSPR would build the PR_Lock code path. There must be something going wrong in
nsprpub/pr/include/md/_darwin.h where _PR_HAVE_ATOMIC_OPS gets defined which
should cause the Atomics in nsprpub/pr/src/md/unix/os_Darwin_ppc.s to be used .
The attached patch makes NSPR use OSAtomic functions for all OS X versions from
10.4 on independent of the architecture, the way it was already implemented for
ARM CPUs.
Furthermore it provides implementations of PR_StackPush and PR_StackPop copied
and adapted from their AIX counterpart, getting rid of a nasty dependency on
PR_Lock.
I wonder though whether the non barrier version is really the appropriate one.
I made PR_StackPush and PR_StackPop use the barrier version because the AIX
version does explicitely use the barrier versions while non barrier versions
are also available there.
So far the browser works as always and performance is good. PR_StackPush and
PR_StackPop are used quite frequently from what I see when setting breakpoints
in them.
Original comment by Tobias.N...@gmail.com
on 26 Nov 2012 at 10:37
Attachments:
I'll definitely take the PR_Stack* ops, anything to reduce the use of PR_Lock.
I'll have to study them later to see how they work, but I'm all for anything
that gets us out of mutex hell.
I'm not as happy with the _MD_ATOMIC_SET: I'd rather use our assembler ones
because we know exactly what is getting emitted. It doesn't get a lot simpler
than lwarx/stwcx. in a loop. But if the browser seems okay with a non-barrier
atomic_set, we could just use that everywhere. (It seems to be fine, but
there's a lot of atomic usage in NSS and I really don't want to get that wrong
-- maybe the best solution is to try to get NSPR and NSS testing working at the
same time.)
Original comment by classi...@floodgap.com
on 26 Nov 2012 at 10:48
The advantage of _MD_ATOMIC_SET as it is implemented upstream is however that
the kernel does its multi-/uniprocessor optimization. I knew you won't be too
happy with it.
The optimizations to AtomicSet should definitely go into
nsprpub/pr/src/md/unix/os_Darwin_ppc.s:__PR_DarwinPPC_AtomicSet .
My patch also gets the NSPR tests working here. Just "(g)make runtests" in
{build root}/nsprpub/pr/tests .
Original comment by Tobias.N...@gmail.com
on 26 Nov 2012 at 11:01
That's true (about the optimizations) but if we can get away with no barriers
everywhere, then we don't need it because we'll already be working barrier-free.
I know about os_Darwin_ppc.s, but for some reason the build system was not
picking up my changes, maybe a bug, which is why I implemented it in
os_Darwin.s instead. I'll have to mess with that.
Looks like the PR_Stack* usage comes from NSPR's file descriptor cache, so that
makes sense they get used a lot.
Original comment by classi...@floodgap.com
on 26 Nov 2012 at 11:06
As we're already dealing with functions in the so called comm page. The
objective-C/C++ direct dispatch flag isn't fully implemented in gcc-4.6 and
later. It generates calls to objc_msgSend_Fast and objc_assign_ivar_Fast which
are just stubs that Apple gcc replaces with local branches to 0xFFFEFF00 or
0xFFFEFEC0. I already looked into that and it seems that can be implemented as
builtins.
Original comment by Tobias.N...@gmail.com
on 26 Nov 2012 at 11:28
Is that needed to compile? I'm still ripping out the 4.0.1 stuff from the 17
changesets, so I'm at least a week or two away from even trying to build 19.
Original comment by classi...@floodgap.com
on 26 Nov 2012 at 11:32
No it's not needed as long as you don't pass the flag -fobjc-direct-dispatch to
gcc. Direct dispatching was only used in the 32 bit Objective-C runtime so that
might be the reason it wasn't implemented in current gcc. It's just a matter of
performance once again.
Original comment by Tobias.N...@gmail.com
on 26 Nov 2012 at 11:36
Ah, ok. We should spin that off into a separate issue.
Original comment by classi...@floodgap.com
on 26 Nov 2012 at 11:38
I'll open an issue once I know more - maybe I'll first try to get the WebKit
java script interpreter working...
The browser is working really well. Almost feels full speed although the
machine is building Aurora in the background. Haven't tested yet without the
build running.
Original comment by Tobias.N...@gmail.com
on 26 Nov 2012 at 11:44
Failing lots of tests, though I don't know what the state was prior to this.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 4:18
(so far failing bigfile3, dlltest {even with your patch}, getai, libfilename,
mbcs, and hangs up on nameshm1)
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 4:25
Semaphores are failing bigtime; they just hang. I need to back this out and see
if it passed to begin with.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 4:38
Nope, they don't. In fact, they just hang until they time out. This might have
been the whole problem!
Hacking pr/src/pthreads/ptsynch.c, it seems that sem_otime never gets set to a
non-zero value (but sem_ctime does). From bug 469744
https://bugzilla.mozilla.org/show_bug.cgi?id=469744, it doesn't look like
semaphores were *ever* properly tested on 10.4. Changing it to sem_ctime causes
it to succeed, and all the semaphore tests now pass.
https://bugzilla.mozilla.org/show_bug.cgi?id=370420 implies that failure of
some tests is a normal phenomenon, but that really sucks.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 5:19
Apparently I'm not going insane:
http://lists.apple.com/archives/darwin-dev/2005/Mar/msg00147.html
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 5:25
After thinking it over, the ctime hack is going to lead to a race condition
sooner or later. Since we have the choice of POSIX or SysV semaphores in 10.4,
changed md/_darwin.h to POSIX semaphores, and the semaphore tests all pass. I'm
going to leave the rest alone for now. I restored the PR_Stack* ops and we'll
see how the browser behaves when it's done building.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 5:47
I forgot to mention the failing tests - I didn't mind the failures because the
browser is working well obviously.
And I have to "mske clean" after doing any change to NSPR sources since
dependency tracking doesn't seem to work there.
Original comment by Tobias.N...@gmail.com
on 27 Nov 2012 at 9:47
Initial tests look good (with the change to PR_OpenSemaphore and your PR_Stack*
ops, and removing all memory barriers). Obviously a lot less hanging in
semaphores too. KFI is now bearable; I need to throw it through Shark again to
see what's left to shake out, but this does look better. I'm going to do a more
thorough test on it after work.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 2:33
[deleted comment]
Looking into the xnu sources shows that the sem_otime bug was fixed in 10.5
(http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-792.6.70 vs
http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-1228).
Though there's a difference between using POSIX and SYSV semaphores in the test
results. With SYSV semaphores ranfile waits infinitely in pthread_wait_cond and
with POSIX semaphores nonblock fails.
Original comment by Tobias.N...@gmail.com
on 27 Nov 2012 at 9:49
No, those two tests sometimes fail and sometimes pass independently of SYSV or
POSIX semaphores. So both do work on 10.5 .
Original comment by Tobias.N...@gmail.com
on 27 Nov 2012 at 9:52
If you use POSIX semaphores, does the browser still work on 10.5 or do you get
stalls? It's going to be really hard to ship a browser that's dynamically
selectable.
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 9:52
(nvm, you just answered my question)
Original comment by classi...@floodgap.com
on 27 Nov 2012 at 9:52
When you search the whole source code for OSAtomic you will find quite some
usages of those functions - and all of them (except the upcoming unused
jemalloc 3) use the barrier versions.
And that said, on an architecture with out of order instruction scheduling a
lock is a true lock only when it deals with that issue. For locking purposes
you have to pay the price for using multiple CPUs and do the syncing and memory
barriers.
I also found out that chromium IPC uses OSAtomic only on Intel OS X - I changed
that to apply to OS X systems independent of the CPU architecture.
I was surprised to find find out that cairo apparently uses pthread mutexes
although there is code to use system atomics but the necessary defines aren't
done anywhere. Is cairo used for anything more than a azure backend fallback?
Original comment by Tobias.N...@gmail.com
on 27 Nov 2012 at 11:59
I'll fix the Chromium code. Cairo is used heavily for graphics still -- we
don't have Azure content yet. Maybe we need to add atomic support there too.
However, assuming this quad G5 is the worst case for CPU on Power Mac, I
haven't had any problems banging on it this afternoon. Web workers and multiple
tabs loading are all rock solid. However, PR_Wait is still showing up a lot on
kfi, so we haven't fixed its core issue though we've made a lot of other things
a lot better.
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 1:23
Actually, let's use barrier for Chromium IPC. If anything would break with a
non-barrier sequence, it would be IPC.
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 1:37
Well, here is what I did to the Chromium IPC.
Original comment by Tobias.N...@gmail.com
on 28 Nov 2012 at 1:44
Attachments:
Why does the x86.h file need to be deleted?
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 1:47
No, I renamed it (just removed x86 from its name because it's nonsense) but
that's not obvious in diffs.
Original comment by Tobias.N...@gmail.com
on 28 Nov 2012 at 1:49
I finally found out that in gfx/cairo/cairo/src/cairo-platform.h CAIRO_NO_MUTEX
is defined which short circuits all mutexes in cairo. So there should be any
problem in cairo at all.
Original comment by Tobias.N...@gmail.com
on 28 Nov 2012 at 1:51
Too bad about Cairo. Did you make any changes to the x86 file? I might just
simplify it for testing by just including it, and we can tidy it up later (not
that Chromium changes a lot, but just to reduce merge burden).
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 1:54
No changes; just including the x86 header on ppc machines will work in that
case.
Original comment by Tobias.N...@gmail.com
on 28 Nov 2012 at 1:59
I reimplemented PR_StackPush/Pop using OSSpinLock after having examined the
kernel code for that - it was made exactly for that purpose, but it applies
memory barriers and does syncing on SMP systems.
#include <libkern/OSAtomic.h>
static OSSpinLock stackLock = OS_SPINLOCK_INIT;
PR_IMPLEMENT(void)
PR_StackPush(PRStack *stack, PRStackElem *stack_elem)
{
OSSpinLockLock(&stackLock);
stack_elem->prstk_elem_next = stack->prstk_head.prstk_elem_next;
stack->prstk_head.prstk_elem_next = stack_elem;
OSSpinLockUnlock(&stackLock);
return;
}
PR_IMPLEMENT(PRStackElem *)
PR_StackPop(PRStack *stack)
{
PRStackElem *element;
OSSpinLockLock(&stackLock);
element = stack->prstk_head.prstk_elem_next;
if (element != NULL) {
stack->prstk_head.prstk_elem_next = element->prstk_elem_next;
element->prstk_elem_next = NULL; /* debugging aid */
}
OSSpinLockUnlock(&stackLock);
return element;
}
Original comment by Tobias.N...@gmail.com
on 28 Nov 2012 at 2:20
Landed spinlock change and Chromium atomics. Looks good so far.
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 4:04
Okay, I'm pretty satisfied with how it's acting. QTE works fine, SSL and SPDY
look good. Flipping over to build 7450. If that's good too, we'll build 7400
and G3 and make this a second beta.
Original comment by classi...@floodgap.com
on 28 Nov 2012 at 4:56
I don't know if it's caused by this bug, but
https://productforums.google.com/forum/#!categories/websearch loads veeeeery
slowly (ca. 50 s.) in TFF 17.0 on my Mac 7450.
Original comment by jerzyglowacki
on 30 Nov 2012 at 2:27
No. The new Google Groups (which includes the product forums) is just extremely
JavaScript heavy. In fact, the changes above do improve its performance, though
it is still slow.
Original comment by classi...@floodgap.com
on 30 Nov 2012 at 2:40
No problems with that site here on Aurora 19 on a G4-7450@1.5GHz , even when
it's building mozilla in the background.
Original comment by Tobias.N...@gmail.com
on 30 Nov 2012 at 8:33
Interesting this ancient mozilla bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=194339
Though our solution is far more portable - and NSPR itself is rather poorly
maintained.
Original comment by Tobias.N...@gmail.com
on 30 Nov 2012 at 5:32
Original comment by classi...@floodgap.com
on 30 Nov 2012 at 6:56
Yeah, their code is actually a very poor performer because it has a sync right
in the middle of it. Yours is much higher performance. Anyway, I'm going to
close this since we're shipping the code, and look in more detail at the
semaphore wait lag in issue 193.
Original comment by classi...@floodgap.com
on 30 Nov 2012 at 6:57
Original issue reported on code.google.com by
classi...@floodgap.com
on 16 Nov 2012 at 3:40