Closed GoogleCodeExporter closed 9 years ago
It occurs to me that IPC layers could account for this.
Original comment by classi...@floodgap.com
on 3 Dec 2012 at 9:00
https://bugzilla.mozilla.org/show_bug.cgi?id=353962 shows a similar bug and how
it was tracked down. This isn't actually our issue (they fixed it in the same
way we fixed issue 191) but some of the analysis might give us some tips.
Original comment by classi...@floodgap.com
on 12 Jan 2013 at 3:54
On meethue.com,
Program received signal SIGINT, Interrupt.
0x904c0d00 in decode_meshed_mx8bpc_x_3s ()
(gdb) info threads
39 process 12175 thread 0x78c3 0x90053f88 in semaphore_timedwait_signal_trap ()
38 process 12175 thread 0xa027 0x90053f88 in semaphore_timedwait_signal_trap ()
37 process 12175 thread 0xc10f 0x90053f88 in semaphore_timedwait_signal_trap ()
33 process 12175 thread 0xb8d7 0x90053f88 in semaphore_timedwait_signal_trap ()
32 process 12175 thread 0x77ff 0x90053f88 in semaphore_timedwait_signal_trap ()
31 process 12175 thread 0xc203 0x90053f88 in semaphore_timedwait_signal_trap ()
30 process 12175 thread 0xb90b 0x9002bfc8 in semaphore_wait_signal_trap ()
29 process 12175 thread 0x9e0b 0x9002bfc8 in semaphore_wait_signal_trap ()
28 process 12175 thread 0xa317 0x9002bfc8 in semaphore_wait_signal_trap ()
27 process 12175 thread 0xa217 0x9002bfc8 in semaphore_wait_signal_trap ()
25 process 12175 thread 0x5043 0x9002bfc8 in semaphore_wait_signal_trap ()
24 process 12175 thread 0x9907 0x9002bfc8 in semaphore_wait_signal_trap ()
23 process 12175 thread 0x9d07 0x9002bfc8 in semaphore_wait_signal_trap ()
22 process 12175 thread 0x9c07 0x9002bfc8 in semaphore_wait_signal_trap ()
21 process 12175 thread 0x9607 0x9002bfc8 in semaphore_wait_signal_trap ()
20 process 12175 thread 0x8003 0x9002bfc8 in semaphore_wait_signal_trap ()
19 process 12175 thread 0x7b57 0x9002bfc8 in semaphore_wait_signal_trap ()
17 process 12175 thread 0x7903 0x9002bfc8 in semaphore_wait_signal_trap ()
15 process 12175 thread 0x7503 0x9002bfc8 in semaphore_wait_signal_trap ()
14 process 12175 thread 0x725f 0x9002bfc8 in semaphore_wait_signal_trap ()
13 process 12175 thread 0x7007 0x9002bfc8 in semaphore_wait_signal_trap ()
12 process 12175 thread 0x5c03 0x9002bfc8 in semaphore_wait_signal_trap ()
10 process 12175 thread 0x4703 0x9002bfc8 in semaphore_wait_signal_trap ()
9 process 12175 thread 0x4603 0x90053f88 in semaphore_timedwait_signal_trap ()
8 process 12175 thread 0x4403 0x90053f88 in semaphore_timedwait_signal_trap ()
7 process 12175 thread 0x4303 0x9002bfc8 in semaphore_wait_signal_trap ()
6 process 12175 thread 0x4203 0x9002bfc8 in semaphore_wait_signal_trap ()
5 process 12175 thread 0x3703 0x9002bfc8 in semaphore_wait_signal_trap ()
4 process 12175 thread 0x3603 0x9001f48c in select ()
3 process 12175 thread 0x3203 0x9002bfc8 in semaphore_wait_signal_trap ()
2 process 12175 thread 0x3103 0x9002ec8c in kevent ()
* 1 process 12175 local thread 0xf03 0x904c0d00 in decode_meshed_mx8bpc_x_3s ()
Good grief.
Original comment by classi...@floodgap.com
on 12 Jan 2013 at 4:25
It's behaving better now. I don't know what the difference is.
Original comment by classi...@floodgap.com
on 12 Jan 2013 at 4:45
Original comment by classi...@floodgap.com
on 29 Jan 2013 at 9:30
Despite of the benchmarks that tell that pthread mutexes are very slow on OS X
there are quite some mailing list threads and benchmarks that reveal that
Apple's pthread mutexes aren't fast in throughput but instead heavily tuned for
real world multithreading situations.
Furthermore Apple makes extensive use of pthread mutexes in the system
libraries, especially where speed is important, at least in 10.5 .
Here two benchmarks (the second one being an improved version of the first
one), BUT more important than the numbers is the description(!):
http://cocoasamurai.blogspot.de/2008/04/osspinlock-lock-showdown-posix-locks-vs.
html
http://perpendiculo.us/?p=133
Original comment by Tobias.N...@gmail.com
on 15 Feb 2013 at 7:44
I think it's pretty obvious we're hitting an edge case with www.kfiam640.com
since we don't hit it all the time. But do you see similar time stuck in
semaphore wait on 10.5 with AuroraFox when you run it against Shark?
Original comment by classi...@floodgap.com
on 15 Feb 2013 at 4:15
A standard time profile for firefox binary (Aurora 20) results looks very
different here.
I clicked through images and links while the profile was created. The browser
was as responsive all always.
17.4% 17.4% mach_kernel vm_map_lookup_entry
10.7% 10.7% mach_kernel vm_map_enter
4.4% 4.4% mach_kernel ml_set_interrupts_enabled
2.6% 2.6% libmozjs.dylib JS_IterateCompartments(JSRuntime*, void*, void (*)(JSRuntime*, void*, JSCompartment*))
2.5% 2.5% mach_kernel hw_rem_map
1.8% 1.8% libobjc.A.dylib __bzero
1.8% 1.8% libmozjs.dylib JS_EnumerateDiagnosticMemoryRegions(int (*)(void*, unsigned long))
1.8% 1.8% mach_kernel hw_add_map
1.8% 1.8% libmozjs.dylib JS::PerfMeasurement::canMeasureSomething()
1.7% 1.7% mach_kernel hw_protect
1.7% 1.7% CoreGraphics vecCGSBlendXXXX8888
1.3% 1.3% libobjc.A.dylib __memcpy
1.3% 1.3% XUL _ZNSt6vectorIsSaIsEE19_M_emplace_back_auxIIsEEEvDpOT_
1.1% 1.1% libmozglue.dylib 0x22450 [21.9KB]
1.1% 1.1% libmozjs.dylib js_GetSrcNoteOffset(unsigned char*, unsigned int)
0.9% 0.9% XUL _ZNSt8_Rb_treeIjSt4pairIKjjESt10_Select1stIS2_ESt4lessIjESaIS2_EE17_M_insert_unique_IS2_EESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EOT_
0.8% 0.8% CoreGraphics CGSFillDRAM8by1
0.7% 0.7% libobjc.A.dylib objc_msgSend_rtp
0.7% 0.7% libobjc.A.dylib __sys_icache_invalidate
0.6% 0.6% XUL xpc_LocalizeContext(JSContext*)
0.6% 0.6% XUL void std::__uninitialized_default_n_1<true>::__uninit_default_n<kinfo_proc*, unsigned long>(kinfo_proc*, unsigned long)
0.6% 0.6% libmozglue.dylib 0x28950 [12.2KB]
0.6% 0.6% libmozjs.dylib js::StringIsArrayIndex(JSLinearString*, unsigned int*)
Original comment by Tobias.N...@gmail.com
on 15 Feb 2013 at 9:28
Here a profile of TFF 19 running on 10.5 while browsing www.kfiam640.com:
14.6% 14.6% mach_kernel vm_map_lookup_entry
7.6% 7.6% mach_kernel vm_map_enter
4.5% 4.5% mach_kernel ml_set_interrupts_enabled
3.9% 3.9% XUL js::UnmarkGrayGCThingRecursively(void*, JSGCTraceKind)
2.6% 2.6% XUL JS_EnumerateDiagnosticMemoryRegions(int (*)(void*, unsigned long))
2.3% 2.3% mach_kernel hw_rem_map
2.0% 2.0% XUL 0x42a005c [1.5KB]
2.0% 2.0% CoreGraphics vecCGSBlendXXXX8888
2.0% 2.0% XUL js_GetSrcNoteOffset(unsigned char*, unsigned int)
1.8% 1.8% mach_kernel hw_add_map
1.7% 1.7% libobjc.A.dylib __memcpy
1.5% 1.5% libobjc.A.dylib __bzero
1.4% 1.4% mach_kernel hw_protect
1.3% 1.3% XUL std::vector<float, std::allocator<float> >::_M_default_append(unsigned long)
1.1% 1.1% XUL js::StringIsArrayIndex(JSLinearString*, unsigned int*)
1.0% 1.0% libobjc.A.dylib __sys_icache_invalidate
1.0% 1.0% XUL xpc_LocalizeContext(JSContext*)
0.9% 0.9% XUL JS_InitReflect(JSContext*, JSObject*)
0.8% 0.8% libSystem.B.dylib tiny_malloc_from_free_list
0.7% 0.7% libSystem.B.dylib szone_size
0.7% 0.7% libSystem.B.dylib szone_free
0.7% 0.7% XUL js::StackSpace::sizeOf()
0.6% 0.6% libobjc.A.dylib __spin_lock
0.6% 0.6% XUL SharedStub
Original comment by Tobias.N...@gmail.com
on 15 Feb 2013 at 9:37
Does this also happen on single core systems?
Original comment by Tobias.N...@gmail.com
on 15 Feb 2013 at 9:39
Ah, I see you activated tree view in Shark, while I didn't. That explains the
values in the first column.
semaphore_wait_signal_trap had 0% in both profiles.
Original comment by Tobias.N...@gmail.com
on 15 Feb 2013 at 9:53
Yes, I see it on my iMac G4. I think something changed in the pthread
implementation between 10.4 and 10.5.
Original comment by classi...@floodgap.com
on 15 Feb 2013 at 11:42
So here is pthread_cond.c from 10.4.11's libc:
http://www.opensource.apple.com/source/Libc/Libc-391.2.10/pthreads/pthread_cond.
c
and here is pthread_cond.c from 10.5.8's libc:
http://www.opensource.apple.com/source/Libc/Libc-498.1.7/pthreads/pthread_cond.c
Not only are there some differences already, but 10.5.8 is UNIX 03 conforming,
while 10.4.11 is not, so they will take different code paths. I don't know if
this is the culprit, but they are definitely not the same. See
http://www.opensource.apple.com/source/xnu/xnu-792.24.17/bsd/sys/cdefs.h
http://www.opensource.apple.com/source/xnu/xnu-1228.15.4/bsd/sys/cdefs.h
Original comment by classi...@floodgap.com
on 16 Feb 2013 at 12:03
Looking through libc, it appears possible to get at the $UNIX2003 semantics
from 10.4. This does link (obviously it doesn't work).
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
int main(int argc, char **argv) {
pthread_cond_t *x;
pthread_mutex_t *m;
struct timespec *y;
return pthread_cond_wait$UNIX2003(x, m, y, 0);
}
Original comment by classi...@floodgap.com
on 16 Feb 2013 at 12:11
Can you do nm /usr/lib/libc.dylib | grep pthread_cond and make sure 10.5 has
the $UNIX2003 versions also?
/usr/lib/libc.dylib(pthread_cond-UNIX03.So):
:
:
90094e20 T _pthread_cond_wait$UNIX2003
:
:
Original comment by classi...@floodgap.com
on 16 Feb 2013 at 12:13
There is also a $NOCANCEL version.
nm /usr/lib/libc.dylib | grep pthread_cond_wait
0010fba8 t ___pthread_cond_wait
00043e40 t __pthread_cond_wait
0008c848 T _pthread_cond_wait
0007b77c T _pthread_cond_wait$NOCANCEL$UNIX2003
00043e04 T _pthread_cond_wait$UNIX2003
Original comment by Tobias.N...@gmail.com
on 16 Feb 2013 at 4:56
I think for 20 I'm going to try the $UNIX2003 versions for 10.4, where
supported. They shouldn't make any difference on 10.5 since it should run the
UNIX 03 versions by default.
Original comment by classi...@floodgap.com
on 16 Feb 2013 at 11:13
NSPR does work and passes all tests it passed previously. cvar2 in nsprpub
shows no time difference between the two, but we presume there is little thread
contention in the test case. Leaving to rebuild overnight.
Original comment by classi...@floodgap.com
on 18 Feb 2013 at 6:24
No difference in the test case; it still spends nearly as much time in the
routine as it did before. Just to check wall-times, on this quad G5 in Reduced
power mode, with nothing else open, KFI renders in about 45-50 seconds from
start to complete. How fast does AuroraFox do it?
Original comment by classi...@floodgap.com
on 18 Feb 2013 at 6:17
On an iMac G5 2.1 in automatic performance mode with KFI already in the
browser's cache while running Mail and Safari (with several tabs open),
completely loading and rendering KFI takes 13-14 seconds both in TFF 19 and
Aurora 19.
Original comment by Tobias.N...@gmail.com
on 19 Feb 2013 at 7:00
This takes seven seconds in Safari using current Leopard WebKit.
Original comment by Tobias.N...@gmail.com
on 19 Feb 2013 at 7:01
Actually more like 10-13 seconds in Aurora and TenFourFox under conditions
similar to Safari using Leopard WebKit.
Original comment by Tobias.N...@gmail.com
on 19 Feb 2013 at 7:08
Aurora and TenFourFox had AdBlock Plus enabled.
Original comment by Tobias.N...@gmail.com
on 19 Feb 2013 at 7:48
Mine also has Adblock on. So this clearly indicates something is really wrong
on 10.4. At least it's not the browser itself, apparently.
I need to look more at the kernel semaphore wait code.
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 8:04
So probably the reason they didn't make a difference is that both conforming
and non-conforming versions end up at the same place in the kernel through
different routes. The non-conforming version does a system call into
semaphore_wait_signal_trap, which then becomes
semaphore_wait_signal_trap_internal in kernelspace. The conforming version
calls __semwait_signal, which directly calls
semaphore_wait_signal_trap_internal.
http://www.opensource.apple.com/source/xnu/xnu-792.24.17/bsd/kern/kern_sig.c
http://www.opensource.apple.com/source/xnu/xnu-792.24.17/osfmk/kern/sync_sema.c
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 9:09
One extremely evil thought is to try to rewrite as condition variables under
MPTasks. That should still work in 10.5, but I have to make sure it doesn't
ultimately wind up in the same code.
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 9:20
The MPWaitOnSemaphore function ends up at the same code. I'm out of bright
ideas for the moment. http://forums.codeblocks.org/index.php?topic=15263.15;wap2
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 9:28
Or ... we create a new implementation of NSPR condvars that uses POSIX
semaphores. http://research.microsoft.com/pubs/64242/ImplementingCVs.pdf
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 11:01
(and more importantly spinlocks to grab the mutexes, which will presumably be
lighter weight)
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 11:02
... and, in fact, the algorithm described in the Microsoft paper is indeed how
it was implemented for Bthreads. I might play with this later. There is
probably some good way to use spinlocks to avoid unnecessary context switches.
Original comment by classi...@floodgap.com
on 19 Feb 2013 at 11:30
That doesn't seem to scale too well.
Looking at consumers, the biggest user of condvars is the XPCOM
ReentrantMonitor implementation, which is implemented in terms of PRMonitor.
It seems like we could do better than the current way using spinlocks and
semaphores:
PR_NewMonitor creates a new monitor object. We give it an OSSpinLock
(malloc(sizeof(OSSpinLock)), compute a "temporary file" as a key using
mktemp(3) on a stack-allocated buffer, and hand that to sem_open to make a new
semaphore. PR_DestroyMonitor destroys all of these.
PR_EnterMonitor does a OSSpinLockLock on the allocated spinlock, blocking until
the lock is acquired. Once it does, entry_count++
PR_LeaveMonitor decrements entry_count. If entry_count is zero,
OSSpinLockUnlock on the spinlock.
PR_Wait saves entry count, sets it to zero, unlocks the spinlock, and waits on
the semaphore. When it gets awakened by the semaphore, it locks the spinlock
(spinning until it does) and restores the entry count.
In this design, there is no difference between PR_Notify and PR_NotifyAll; they
will both do a sem_post(). Any of the threads waiting on the semaphore will
wake up and one of them will successfully get the lock again and the others
will spin. Of note, however, is that most consumers do a PR_Notify and not a
PR_NotifyAll.
The potential problem is that if the critical section is long, we could have a
lot of threads doing spinlocks, which would chug.
Original comment by classi...@floodgap.com
on 20 Feb 2013 at 4:47
Spinlocks do have backoff, however (see the relinquish code at the bottom):
http://opensource.apple.com/source/xnu/xnu-792/osfmk/ppc/commpage/spinlocks.s?tx
t
Original comment by classi...@floodgap.com
on 20 Feb 2013 at 4:57
Okay, I have a miniature implementation running. However,
nsprpub/pr/tests/suspend.c hangs in PR_Wait. It seems to do this with the
regular pthreads implementation too, however. Does this work for you in 10.5?
Original comment by classi...@floodgap.com
on 20 Feb 2013 at 6:39
We need to dump sem_wait, it doesn't support a timeout. But MP semaphores do.
Original comment by classi...@floodgap.com
on 21 Feb 2013 at 2:36
This doesn't seem like the right way to go. NSPR perf testing shows this
increases the context switching time from 17usec to 51usec. I think we're stuck
with the condvar implementation.
Original comment by classi...@floodgap.com
on 21 Feb 2013 at 5:14
Posting the diff here in case I get a brainwave about it later.
Original comment by classi...@floodgap.com
on 21 Feb 2013 at 5:16
Attachments:
Hi guys, I had posted a couple comments in the help area before I found this.
I think this site may exhibit the problem you're talking about:
http://uploaded.net/file/3bzxf4e0
This site is very slow on 10.4.11 (powerbook) and tff 17.0.2 with this shark
trace:
0.0% 95.5% mach_kernel thandler
0.1% 95.5% mach_kernel trap
0.2% 93.5% mach_kernel vm_fault
0.1% 90.8% mach_kernel vm_map_lookup_locked
90.7% 90.7% mach_kernel vm_map_lookup_entry
0.0% 0.0% mach_kernel _mutex_lock
0.0% 0.0% mach_kernel _mutex_lock
0.0% 0.0% mach_kernel IOService::systemWake()
This same URL on 10.5.8 (g4 mac mini) and tff 17.0.1 is *much* faster.
Let me know if/how I can help with testing etc. The "sample" function from
activity monitor shows a very long call list that seems to be related to
graphics. I guess that site is pulsing the logo with a bunch of graphics calls
or something. I can post that output also if it would be useful.
Original comment by buckskin...@gmail.com
on 21 Feb 2013 at 9:28
I certainly see the slowness to which you refer, but realistically testing
isn't going to tell me a whole lot more, and I'm not sure that this isn't
another separate kernel performance issue on 10.4 (which the semaphore wait
problem seems to be).
If you have it handy, though, certainly post any other traces you have. My
suspicion that this is related to IPC layers stands and other than trying to
make them lighter-weight I don't think I can do anything about the kernel
performance issue.
Original comment by classi...@floodgap.com
on 22 Feb 2013 at 5:21
Here is the activity monitor sample for that URL. So you were able to
reproduce what I'm seeing on 10.4?
Original comment by buckskin...@gmail.com
on 22 Feb 2013 at 7:34
Attachments:
There's not actually a lot interesting in that sample, tbh. Those calls are
expected.
Original comment by classi...@floodgap.com
on 22 Feb 2013 at 10:36
The summary at the bottom shows most of the calls are semaphore waits, which is
what you're talking about, right?
semaphore_wait_signal_trap 5529
semaphore_timedwait_signal_trap 975
kevent 296
select 295
__memcpy 66
mach_msg_trap 60
__bzero 22
mergeQuadratic 16
Original comment by buckskin...@gmail.com
on 22 Feb 2013 at 10:46
Yes. What I mean is, it doesn't tell me a whole lot else about what's going on,
even though I appreciate the attempt.
Original comment by classi...@floodgap.com
on 22 Feb 2013 at 11:18
Original comment by classi...@floodgap.com
on 13 Apr 2013 at 12:16
Original comment by classi...@floodgap.com
on 13 Apr 2013 at 12:18
Original comment by classi...@floodgap.com
on 1 Jul 2013 at 3:42
I observed that a Safari session that was used for several days gets less and
less responsive, and spends a lot of time in blocked threads. This seems to
have to do with the total amount of timers/events. DOM events are supposed to
be coalesced in WebKit but there seems to have been a bug in the implementation
which has been discovered and fixed only recently (didn't verify yet if that
fix addresses the situation I observed).
Original comment by Tobias.N...@gmail.com
on 14 Jul 2013 at 10:37
What does this have to do with TenFourFox? Well, the profiles I got were
similar to the ones posted here. I think it's worth mentioning it here as this
is on 10.5 .
Original comment by Tobias.N...@gmail.com
on 14 Jul 2013 at 10:39
I'll try TFF on my PowerBook G4 running the hacked Intel-Tiger - maybe it runs
better there than on PowerPC-Tiger (multithreaded OpenGL was Intel-Tiger only
as well).
Original comment by Tobias.N...@gmail.com
on 14 Jul 2013 at 10:42
Hmmm..., I was really considering putting more work into AuroraFox/TFF - but
after testing parallel browsing of kfi640.com, ifixit.com and Google groups I'm
somewhat disappointed.
In 10.4 the browser is unresponsive most of the time and in 10.5 it's barely
tolerable (both AuroraFox and TFF).
Trying the same using a current WebKit build is an acceptable experience -
although there's still no JavaScript JIT in WebKit at all.
WebKit doesn't do much multithreading; the JavaScript garbage collector runs in
a separate thread by default (but that can be disabled at build time), and
database work like the favicon database is done in a separate thread as well.
HTML parsing can optionally be done in a separate thread but that is not
enabled by default in any port (now that Chrom[e|ium] has forked).
It seems the Google idea of having one process per browser tab/window provides
the fastest multi processing experience on OS X - and Stainless brings a touch
of that to 10.5, for those who care.
Original comment by Tobias.N...@gmail.com
on 16 Jul 2013 at 6:59
So, are you saying that issue 231 should apply to 10.5 as well (i.e., fool
Firefox that there is only one CPU so it won't try to schedule lots of threads)?
Is there a difference between 10.4Fx and AuroraFox on 10.5? That would at least
tell me if jemalloc makes much change, though it seemed to on 10.4. I have a
new jemalloc-build made for 22 which I have yet to test so I can figure out how
to solve issue 218.
Original comment by classi...@floodgap.com
on 16 Jul 2013 at 8:20
Original issue reported on code.google.com by
classi...@floodgap.com
on 30 Nov 2012 at 6:56