Semaphore waits are killers on some sites [meta-threading performance problems]

GoogleCodeExporter commented 9 years ago

Spun off from issue 191. After patching this out, overhead drops a bit, but 
we're still spending ~40% of our time in semaphore lock on the affected sites. 
Shark looks like this:

    0.0%    95.2%   libSystem.B.dylib   _pthread_body   
    0.0%    90.5%   libnspr4.dylib   PR_Select  
    0.0%    61.9%   XUL   XRE_AddStaticComponent    
    0.0%    57.1%   XUL    0x3eef784 [176B] 
    0.0%    57.1%   XUL     XRE_AddStaticComponent  
    0.0%    47.6%   XUL      XRE_AddStaticComponent 
    0.0%    42.9%   libnspr4.dylib        PR_Wait   
    0.0%    42.9%   libnspr4.dylib         PR_WaitCondVar   
    0.0%    42.9%   libSystem.B.dylib           pthread_cond_wait   
    42.9%   42.9%   libSystem.B.dylib            semaphore_wait_signal_trap 
    0.0%    4.8%    libnspr4.dylib        PR_WaitCondVar    
    0.0%    4.8%    XUL      NS_CycleCollectorSuspect_P 
    0.0%    4.8%    XUL      0x299ec58 [788B]   
    0.0%    4.8%    libnspr4.dylib     PR_WaitCondVar   
    0.0%    9.5%    XUL   0x29b2afc [884B]  
    0.0%    4.8%    XUL   XRE_StartupTimelineRecord 
    0.0%    4.8%    XUL   js_GetScriptLineExtent(JSScript*) 
    0.0%    4.8%    XUL   js::IterateChunks(JSRuntime*, void*, void (*)(JSRuntime*, void*, js::gc::Chunk*)) 
    0.0%    4.8%    XUL   js::IndirectWrapper::toWrapper()  
    0.0%    4.8%    XUL  std::deque<FilePath, std::allocator<FilePath> >::_M_push_back_aux(FilePath const&) 
    0.0%    4.8%    firefox start

Original issue reported on code.google.com by classi...@floodgap.com on 30 Nov 2012 at 6:56

Blocked on: #191, #205, #218, #219, #231

GoogleCodeExporter commented 9 years ago

It occurs to me that IPC layers could account for this.

Original comment by classi...@floodgap.com on 3 Dec 2012 at 9:00

GoogleCodeExporter commented 9 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=353962 shows a similar bug and how 
it was tracked down. This isn't actually our issue (they fixed it in the same 
way we fixed issue 191) but some of the analysis might give us some tips.

Original comment by classi...@floodgap.com on 12 Jan 2013 at 3:54

GoogleCodeExporter commented 9 years ago

On meethue.com,

Program received signal SIGINT, Interrupt.
0x904c0d00 in decode_meshed_mx8bpc_x_3s ()
(gdb) info threads
  39 process 12175 thread 0x78c3  0x90053f88 in semaphore_timedwait_signal_trap ()
  38 process 12175 thread 0xa027  0x90053f88 in semaphore_timedwait_signal_trap ()
  37 process 12175 thread 0xc10f  0x90053f88 in semaphore_timedwait_signal_trap ()
  33 process 12175 thread 0xb8d7  0x90053f88 in semaphore_timedwait_signal_trap ()
  32 process 12175 thread 0x77ff  0x90053f88 in semaphore_timedwait_signal_trap ()
  31 process 12175 thread 0xc203  0x90053f88 in semaphore_timedwait_signal_trap ()
  30 process 12175 thread 0xb90b  0x9002bfc8 in semaphore_wait_signal_trap ()
  29 process 12175 thread 0x9e0b  0x9002bfc8 in semaphore_wait_signal_trap ()
  28 process 12175 thread 0xa317  0x9002bfc8 in semaphore_wait_signal_trap ()
  27 process 12175 thread 0xa217  0x9002bfc8 in semaphore_wait_signal_trap ()
  25 process 12175 thread 0x5043  0x9002bfc8 in semaphore_wait_signal_trap ()
  24 process 12175 thread 0x9907  0x9002bfc8 in semaphore_wait_signal_trap ()
  23 process 12175 thread 0x9d07  0x9002bfc8 in semaphore_wait_signal_trap ()
  22 process 12175 thread 0x9c07  0x9002bfc8 in semaphore_wait_signal_trap ()
  21 process 12175 thread 0x9607  0x9002bfc8 in semaphore_wait_signal_trap ()
  20 process 12175 thread 0x8003  0x9002bfc8 in semaphore_wait_signal_trap ()
  19 process 12175 thread 0x7b57  0x9002bfc8 in semaphore_wait_signal_trap ()
  17 process 12175 thread 0x7903  0x9002bfc8 in semaphore_wait_signal_trap ()
  15 process 12175 thread 0x7503  0x9002bfc8 in semaphore_wait_signal_trap ()
  14 process 12175 thread 0x725f  0x9002bfc8 in semaphore_wait_signal_trap ()
  13 process 12175 thread 0x7007  0x9002bfc8 in semaphore_wait_signal_trap ()
  12 process 12175 thread 0x5c03  0x9002bfc8 in semaphore_wait_signal_trap ()
  10 process 12175 thread 0x4703  0x9002bfc8 in semaphore_wait_signal_trap ()
  9 process 12175 thread 0x4603  0x90053f88 in semaphore_timedwait_signal_trap ()
  8 process 12175 thread 0x4403  0x90053f88 in semaphore_timedwait_signal_trap ()
  7 process 12175 thread 0x4303  0x9002bfc8 in semaphore_wait_signal_trap ()
  6 process 12175 thread 0x4203  0x9002bfc8 in semaphore_wait_signal_trap ()
  5 process 12175 thread 0x3703  0x9002bfc8 in semaphore_wait_signal_trap ()
  4 process 12175 thread 0x3603  0x9001f48c in select ()
  3 process 12175 thread 0x3203  0x9002bfc8 in semaphore_wait_signal_trap ()
  2 process 12175 thread 0x3103  0x9002ec8c in kevent ()
* 1 process 12175 local thread 0xf03  0x904c0d00 in decode_meshed_mx8bpc_x_3s ()

Good grief.

Original comment by classi...@floodgap.com on 12 Jan 2013 at 4:25

GoogleCodeExporter commented 9 years ago

It's behaving better now. I don't know what the difference is.

Original comment by classi...@floodgap.com on 12 Jan 2013 at 4:45

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 29 Jan 2013 at 9:30

Now blocked on: #205

GoogleCodeExporter commented 9 years ago

Despite of the benchmarks that tell that pthread mutexes are very slow on OS X 
there are quite some mailing list threads and benchmarks that reveal that 
Apple's pthread mutexes aren't fast in throughput but instead heavily tuned for 
real world multithreading situations.
Furthermore Apple makes extensive use of pthread mutexes in the system 
libraries, especially where speed is important, at least in 10.5 .

Here two benchmarks (the second one being an improved version of the first 
one), BUT more important than the numbers is the description(!):
http://cocoasamurai.blogspot.de/2008/04/osspinlock-lock-showdown-posix-locks-vs.
html
http://perpendiculo.us/?p=133

Original comment by Tobias.N...@gmail.com on 15 Feb 2013 at 7:44

GoogleCodeExporter commented 9 years ago

I think it's pretty obvious we're hitting an edge case with www.kfiam640.com 
since we don't hit it all the time. But do you see similar time stuck in 
semaphore wait on 10.5 with AuroraFox when you run it against Shark?

Original comment by classi...@floodgap.com on 15 Feb 2013 at 4:15

GoogleCodeExporter commented 9 years ago

A standard time profile for firefox binary (Aurora 20) results looks very 
different here.
I clicked through images and links while the profile was created. The browser 
was as responsive all always.

    17.4%   17.4%   mach_kernel vm_map_lookup_entry
    10.7%   10.7%   mach_kernel vm_map_enter
    4.4%        4.4%        mach_kernel ml_set_interrupts_enabled
    2.6%        2.6%        libmozjs.dylib  JS_IterateCompartments(JSRuntime*, void*, void (*)(JSRuntime*, void*, JSCompartment*))
    2.5%        2.5%        mach_kernel hw_rem_map
    1.8%        1.8%        libobjc.A.dylib __bzero
    1.8%        1.8%        libmozjs.dylib  JS_EnumerateDiagnosticMemoryRegions(int (*)(void*, unsigned long))
    1.8%        1.8%        mach_kernel hw_add_map
    1.8%        1.8%        libmozjs.dylib  JS::PerfMeasurement::canMeasureSomething()
    1.7%        1.7%        mach_kernel hw_protect
    1.7%        1.7%        CoreGraphics    vecCGSBlendXXXX8888
    1.3%        1.3%        libobjc.A.dylib __memcpy
    1.3%        1.3%        XUL _ZNSt6vectorIsSaIsEE19_M_emplace_back_auxIIsEEEvDpOT_
    1.1%        1.1%        libmozglue.dylib    0x22450 [21.9KB]
    1.1%        1.1%        libmozjs.dylib  js_GetSrcNoteOffset(unsigned char*, unsigned int)
    0.9%        0.9%        XUL _ZNSt8_Rb_treeIjSt4pairIKjjESt10_Select1stIS2_ESt4lessIjESaIS2_EE17_M_insert_unique_IS2_EESt17_Rb_tree_iteratorIS2_ESt23_Rb_tree_const_iteratorIS2_EOT_
    0.8%        0.8%        CoreGraphics    CGSFillDRAM8by1
    0.7%        0.7%        libobjc.A.dylib objc_msgSend_rtp
    0.7%        0.7%        libobjc.A.dylib __sys_icache_invalidate
    0.6%        0.6%        XUL xpc_LocalizeContext(JSContext*)
    0.6%        0.6%        XUL void std::__uninitialized_default_n_1<true>::__uninit_default_n<kinfo_proc*, unsigned long>(kinfo_proc*, unsigned long)
    0.6%        0.6%        libmozglue.dylib    0x28950 [12.2KB]
    0.6%        0.6%        libmozjs.dylib  js::StringIsArrayIndex(JSLinearString*, unsigned int*)

Original comment by Tobias.N...@gmail.com on 15 Feb 2013 at 9:28

GoogleCodeExporter commented 9 years ago

Here a profile of TFF 19 running on 10.5 while browsing www.kfiam640.com:

    14.6%   14.6%   mach_kernel vm_map_lookup_entry
    7.6%    7.6%    mach_kernel vm_map_enter
    4.5%    4.5%    mach_kernel ml_set_interrupts_enabled
    3.9%    3.9%    XUL js::UnmarkGrayGCThingRecursively(void*, JSGCTraceKind)
    2.6%    2.6%    XUL JS_EnumerateDiagnosticMemoryRegions(int (*)(void*, unsigned long))
    2.3%    2.3%    mach_kernel hw_rem_map
    2.0%    2.0%    XUL 0x42a005c [1.5KB]
    2.0%    2.0%    CoreGraphics    vecCGSBlendXXXX8888
    2.0%    2.0%    XUL js_GetSrcNoteOffset(unsigned char*, unsigned int)
    1.8%    1.8%    mach_kernel hw_add_map
    1.7%    1.7%    libobjc.A.dylib __memcpy
    1.5%    1.5%    libobjc.A.dylib __bzero
    1.4%    1.4%    mach_kernel hw_protect
    1.3%    1.3%    XUL std::vector<float, std::allocator<float> >::_M_default_append(unsigned long)
    1.1%    1.1%    XUL js::StringIsArrayIndex(JSLinearString*, unsigned int*)
    1.0%    1.0%    libobjc.A.dylib __sys_icache_invalidate
    1.0%    1.0%    XUL xpc_LocalizeContext(JSContext*)
    0.9%    0.9%    XUL JS_InitReflect(JSContext*, JSObject*)
    0.8%    0.8%    libSystem.B.dylib   tiny_malloc_from_free_list
    0.7%    0.7%    libSystem.B.dylib   szone_size
    0.7%    0.7%    libSystem.B.dylib   szone_free
    0.7%    0.7%    XUL js::StackSpace::sizeOf()
    0.6%    0.6%    libobjc.A.dylib __spin_lock
    0.6%    0.6%    XUL SharedStub

Original comment by Tobias.N...@gmail.com on 15 Feb 2013 at 9:37

GoogleCodeExporter commented 9 years ago

Does this also happen on single core systems?

Original comment by Tobias.N...@gmail.com on 15 Feb 2013 at 9:39

GoogleCodeExporter commented 9 years ago

Ah, I see you activated tree view in Shark, while I didn't. That explains the 
values in the first column.
semaphore_wait_signal_trap had 0% in both profiles.

Original comment by Tobias.N...@gmail.com on 15 Feb 2013 at 9:53

GoogleCodeExporter commented 9 years ago

Yes, I see it on my iMac G4. I think something changed in the pthread 
implementation between 10.4 and 10.5.

Original comment by classi...@floodgap.com on 15 Feb 2013 at 11:42

GoogleCodeExporter commented 9 years ago

So here is pthread_cond.c from 10.4.11's libc:

http://www.opensource.apple.com/source/Libc/Libc-391.2.10/pthreads/pthread_cond.
c

and here is pthread_cond.c from 10.5.8's libc:

http://www.opensource.apple.com/source/Libc/Libc-498.1.7/pthreads/pthread_cond.c

Not only are there some differences already, but 10.5.8 is UNIX 03 conforming, 
while 10.4.11 is not, so they will take different code paths. I don't know if 
this is the culprit, but they are definitely not the same. See

http://www.opensource.apple.com/source/xnu/xnu-792.24.17/bsd/sys/cdefs.h
http://www.opensource.apple.com/source/xnu/xnu-1228.15.4/bsd/sys/cdefs.h

Original comment by classi...@floodgap.com on 16 Feb 2013 at 12:03

GoogleCodeExporter commented 9 years ago

Looking through libc, it appears possible to get at the $UNIX2003 semantics 
from 10.4. This does link (obviously it doesn't work).

#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>

int main(int argc, char **argv) {
        pthread_cond_t *x;
        pthread_mutex_t *m;
        struct timespec *y;

        return pthread_cond_wait$UNIX2003(x, m, y, 0);
}

Original comment by classi...@floodgap.com on 16 Feb 2013 at 12:11

GoogleCodeExporter commented 9 years ago

Can you do nm /usr/lib/libc.dylib | grep pthread_cond and make sure 10.5 has 
the $UNIX2003 versions also?

/usr/lib/libc.dylib(pthread_cond-UNIX03.So):
:
:
90094e20 T _pthread_cond_wait$UNIX2003
:
:

Original comment by classi...@floodgap.com on 16 Feb 2013 at 12:13

GoogleCodeExporter commented 9 years ago

There is also a $NOCANCEL version.

nm /usr/lib/libc.dylib | grep pthread_cond_wait
0010fba8 t ___pthread_cond_wait
00043e40 t __pthread_cond_wait
0008c848 T _pthread_cond_wait
0007b77c T _pthread_cond_wait$NOCANCEL$UNIX2003
00043e04 T _pthread_cond_wait$UNIX2003

Original comment by Tobias.N...@gmail.com on 16 Feb 2013 at 4:56

GoogleCodeExporter commented 9 years ago

I think for 20 I'm going to try the $UNIX2003 versions for 10.4, where 
supported. They shouldn't make any difference on 10.5 since it should run the 
UNIX 03 versions by default.

Original comment by classi...@floodgap.com on 16 Feb 2013 at 11:13

GoogleCodeExporter commented 9 years ago

NSPR does work and passes all tests it passed previously. cvar2 in nsprpub 
shows no time difference between the two, but we presume there is little thread 
contention in the test case. Leaving to rebuild overnight.

Original comment by classi...@floodgap.com on 18 Feb 2013 at 6:24

GoogleCodeExporter commented 9 years ago

No difference in the test case; it still spends nearly as much time in the 
routine as it did before. Just to check wall-times, on this quad G5 in Reduced 
power mode, with nothing else open, KFI renders in about 45-50 seconds from 
start to complete. How fast does AuroraFox do it?

Original comment by classi...@floodgap.com on 18 Feb 2013 at 6:17

GoogleCodeExporter commented 9 years ago

On an iMac G5 2.1 in automatic performance mode with KFI already in the 
browser's cache while running Mail and Safari (with several tabs open), 
completely loading and rendering KFI takes 13-14 seconds both in TFF 19 and 
Aurora 19.

Original comment by Tobias.N...@gmail.com on 19 Feb 2013 at 7:00

GoogleCodeExporter commented 9 years ago

This takes seven seconds in Safari using current Leopard WebKit.

Original comment by Tobias.N...@gmail.com on 19 Feb 2013 at 7:01

GoogleCodeExporter commented 9 years ago

Actually more like 10-13 seconds in Aurora and TenFourFox under conditions 
similar to Safari using Leopard WebKit.

Original comment by Tobias.N...@gmail.com on 19 Feb 2013 at 7:08

GoogleCodeExporter commented 9 years ago

Aurora and TenFourFox had AdBlock Plus enabled.

Original comment by Tobias.N...@gmail.com on 19 Feb 2013 at 7:48

GoogleCodeExporter commented 9 years ago

Mine also has Adblock on. So this clearly indicates something is really wrong 
on 10.4. At least it's not the browser itself, apparently.

I need to look more at the kernel semaphore wait code.

Original comment by classi...@floodgap.com on 19 Feb 2013 at 8:04

Changed title: Semaphore waits are killers on some sites [10.4 only]

GoogleCodeExporter commented 9 years ago

So probably the reason they didn't make a difference is that both conforming 
and non-conforming versions end up at the same place in the kernel through 
different routes. The non-conforming version does a system call into 
semaphore_wait_signal_trap, which then becomes 
semaphore_wait_signal_trap_internal in kernelspace. The conforming version 
calls __semwait_signal, which directly calls 
semaphore_wait_signal_trap_internal.

http://www.opensource.apple.com/source/xnu/xnu-792.24.17/bsd/kern/kern_sig.c
http://www.opensource.apple.com/source/xnu/xnu-792.24.17/osfmk/kern/sync_sema.c

Original comment by classi...@floodgap.com on 19 Feb 2013 at 9:09

GoogleCodeExporter commented 9 years ago

One extremely evil thought is to try to rewrite as condition variables under 
MPTasks. That should still work in 10.5, but I have to make sure it doesn't 
ultimately wind up in the same code.

Original comment by classi...@floodgap.com on 19 Feb 2013 at 9:20

GoogleCodeExporter commented 9 years ago

The MPWaitOnSemaphore function ends up at the same code. I'm out of bright 
ideas for the moment. http://forums.codeblocks.org/index.php?topic=15263.15;wap2

Original comment by classi...@floodgap.com on 19 Feb 2013 at 9:28

GoogleCodeExporter commented 9 years ago

Or ... we create a new implementation of NSPR condvars that uses POSIX 
semaphores. http://research.microsoft.com/pubs/64242/ImplementingCVs.pdf

Original comment by classi...@floodgap.com on 19 Feb 2013 at 11:01

GoogleCodeExporter commented 9 years ago

(and more importantly spinlocks to grab the mutexes, which will presumably be 
lighter weight)

Original comment by classi...@floodgap.com on 19 Feb 2013 at 11:02

GoogleCodeExporter commented 9 years ago

... and, in fact, the algorithm described in the Microsoft paper is indeed how 
it was implemented for Bthreads. I might play with this later. There is 
probably some good way to use spinlocks to avoid unnecessary context switches.

Original comment by classi...@floodgap.com on 19 Feb 2013 at 11:30

GoogleCodeExporter commented 9 years ago

That doesn't seem to scale too well.

Looking at consumers, the biggest user of condvars is the XPCOM 
ReentrantMonitor implementation, which is implemented in terms of PRMonitor.

It seems like we could do better than the current way using spinlocks and 
semaphores:

PR_NewMonitor creates a new monitor object. We give it an OSSpinLock 
(malloc(sizeof(OSSpinLock)), compute a "temporary file" as a key using 
mktemp(3) on a stack-allocated buffer, and hand that to sem_open to make a new 
semaphore. PR_DestroyMonitor destroys all of these.

PR_EnterMonitor does a OSSpinLockLock on the allocated spinlock, blocking until 
the lock is acquired. Once it does, entry_count++

PR_LeaveMonitor decrements entry_count. If entry_count is zero, 
OSSpinLockUnlock on the spinlock.

PR_Wait saves entry count, sets it to zero, unlocks the spinlock, and waits on 
the semaphore. When it gets awakened by the semaphore, it locks the spinlock 
(spinning until it does) and restores the entry count.

In this design, there is no difference between PR_Notify and PR_NotifyAll; they 
will both do a sem_post(). Any of the threads waiting on the semaphore will 
wake up and one of them will successfully get the lock again and the others 
will spin. Of note, however, is that most consumers do a PR_Notify and not a 
PR_NotifyAll.

The potential problem is that if the critical section is long, we could have a 
lot of threads doing spinlocks, which would chug.

Original comment by classi...@floodgap.com on 20 Feb 2013 at 4:47

GoogleCodeExporter commented 9 years ago

Spinlocks do have backoff, however (see the relinquish code at the bottom):

http://opensource.apple.com/source/xnu/xnu-792/osfmk/ppc/commpage/spinlocks.s?tx
t

Original comment by classi...@floodgap.com on 20 Feb 2013 at 4:57

GoogleCodeExporter commented 9 years ago

Okay, I have a miniature implementation running. However, 
nsprpub/pr/tests/suspend.c hangs in PR_Wait. It seems to do this with the 
regular pthreads implementation too, however. Does this work for you in 10.5?

Original comment by classi...@floodgap.com on 20 Feb 2013 at 6:39

GoogleCodeExporter commented 9 years ago

We need to dump sem_wait, it doesn't support a timeout. But MP semaphores do.

Original comment by classi...@floodgap.com on 21 Feb 2013 at 2:36

GoogleCodeExporter commented 9 years ago

This doesn't seem like the right way to go. NSPR perf testing shows this 
increases the context switching time from 17usec to 51usec. I think we're stuck 
with the condvar implementation.

Original comment by classi...@floodgap.com on 21 Feb 2013 at 5:14

GoogleCodeExporter commented 9 years ago

Posting the diff here in case I get a brainwave about it later.

Original comment by classi...@floodgap.com on 21 Feb 2013 at 5:16

Attachments:

sem-diff.txt

GoogleCodeExporter commented 9 years ago

Hi guys, I had posted a couple comments in the help area before I found this.  
I think this site may exhibit the problem you're talking about:

http://uploaded.net/file/3bzxf4e0

This site is very slow on 10.4.11 (powerbook) and tff 17.0.2 with this shark 
trace:

    0.0%    95.5%   mach_kernel thandler    
    0.1%    95.5%   mach_kernel  trap   
    0.2%    93.5%   mach_kernel   vm_fault  
    0.1%    90.8%   mach_kernel    vm_map_lookup_locked 
    90.7%   90.7%   mach_kernel     vm_map_lookup_entry 
    0.0%    0.0%    mach_kernel      _mutex_lock    
    0.0%    0.0%    mach_kernel     _mutex_lock 
    0.0%    0.0%    mach_kernel     IOService::systemWake() 

This same URL on 10.5.8 (g4 mac mini) and tff 17.0.1 is *much* faster.

Let me know if/how I can help with testing etc.  The "sample" function from 
activity monitor shows a very long call list that seems to be related to 
graphics.  I guess that site is pulsing the logo with a bunch of graphics calls 
or something.  I can post that output also if it would be useful.

Original comment by buckskin...@gmail.com on 21 Feb 2013 at 9:28

GoogleCodeExporter commented 9 years ago

I certainly see the slowness to which you refer, but realistically testing 
isn't going to tell me a whole lot more, and I'm not sure that this isn't 
another separate kernel performance issue on 10.4 (which the semaphore wait 
problem seems to be).

If you have it handy, though, certainly post any other traces you have. My 
suspicion that this is related to IPC layers stands and other than trying to 
make them lighter-weight I don't think I can do anything about the kernel 
performance issue.

Original comment by classi...@floodgap.com on 22 Feb 2013 at 5:21

GoogleCodeExporter commented 9 years ago

Here is the activity monitor sample for that URL.  So you were able to 
reproduce what I'm seeing on 10.4?

Original comment by buckskin...@gmail.com on 22 Feb 2013 at 7:34

Attachments:

SampleTenFourFox.txt

GoogleCodeExporter commented 9 years ago

There's not actually a lot interesting in that sample, tbh. Those calls are 
expected.

Original comment by classi...@floodgap.com on 22 Feb 2013 at 10:36

GoogleCodeExporter commented 9 years ago

The summary at the bottom shows most of the calls are semaphore waits, which is 
what you're talking about, right?

        semaphore_wait_signal_trap        5529
        semaphore_timedwait_signal_trap        975
        kevent        296
        select        295
        __memcpy        66
        mach_msg_trap        60
        __bzero        22
        mergeQuadratic        16

Original comment by buckskin...@gmail.com on 22 Feb 2013 at 10:46

GoogleCodeExporter commented 9 years ago

Yes. What I mean is, it doesn't tell me a whole lot else about what's going on, 
even though I appreciate the attempt.

Original comment by classi...@floodgap.com on 22 Feb 2013 at 11:18

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 13 Apr 2013 at 12:16

Now blocked on: #218

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 13 Apr 2013 at 12:18

Now blocked on: #219

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 1 Jul 2013 at 3:42

Now blocked on: #231

GoogleCodeExporter commented 9 years ago

I observed that a Safari session that was used for several days gets less and 
less responsive, and spends a lot of time in blocked threads. This seems to 
have to do with the total amount of timers/events. DOM events are supposed to 
be coalesced in WebKit but there seems to have been a bug in the implementation 
which has been discovered and fixed only recently (didn't verify yet if that 
fix addresses the situation I observed).

Original comment by Tobias.N...@gmail.com on 14 Jul 2013 at 10:37

GoogleCodeExporter commented 9 years ago

What does this have to do with TenFourFox? Well, the profiles I got were 
similar to the ones posted here. I think it's worth mentioning it here as this 
is on 10.5 .

Original comment by Tobias.N...@gmail.com on 14 Jul 2013 at 10:39

GoogleCodeExporter commented 9 years ago

I'll try TFF on my PowerBook G4 running the hacked Intel-Tiger - maybe it runs 
better there than on PowerPC-Tiger (multithreaded OpenGL was Intel-Tiger only 
as well).

Original comment by Tobias.N...@gmail.com on 14 Jul 2013 at 10:42

GoogleCodeExporter commented 9 years ago

Hmmm..., I was really considering putting more work into AuroraFox/TFF - but 
after testing parallel browsing of kfi640.com, ifixit.com and Google groups I'm 
somewhat disappointed.
In 10.4 the browser is unresponsive most of the time and in 10.5 it's barely 
tolerable (both AuroraFox and TFF).
Trying the same using a current WebKit build is an acceptable experience - 
although there's still no JavaScript JIT in WebKit at all.
WebKit doesn't do much multithreading; the JavaScript garbage collector runs in 
a separate thread by default (but that can be disabled at build time), and 
database work like the favicon database is done in a separate thread as well. 
HTML parsing can optionally be done in a separate thread but that is not 
enabled by default in any port (now that Chrom[e|ium] has forked).
It seems the Google idea of having one process per browser tab/window provides 
the fastest multi processing experience on OS X - and Stainless brings a touch 
of that to 10.5, for those who care.

Original comment by Tobias.N...@gmail.com on 16 Jul 2013 at 6:59

GoogleCodeExporter commented 9 years ago

So, are you saying that issue 231 should apply to 10.5 as well (i.e., fool 
Firefox that there is only one CPU so it won't try to schedule lots of threads)?

Is there a difference between 10.4Fx and AuroraFox on 10.5? That would at least 
tell me if jemalloc makes much change, though it seemed to on 10.4. I have a 
new jemalloc-build made for 22 which I have yet to test so I can figure out how 
to solve issue 218.

Original comment by classi...@floodgap.com on 16 Jul 2013 at 8:20

keeleysam / tenfourfox

Semaphore waits are killers on some sites [meta-threading performance problems] #193