Closed GoogleCodeExporter closed 9 years ago
I've compiled a small program that just returns 0.
$ time ./t
real 0m8.253s
user 0m1.065s
sys 0m7.165s
Then I've injected lines printing time(NULL) into __asan_init:
$ ./t
line: 649, time: 1325147635
line: 654, time: 1325147635
line: 658, time: 1325147635
line: 692, time: 1325147635
line: 700, time: 1325147635
line: 703, time: 1325147640
line: 706, time: 1325147640
line: 733, time: 1325147643
line: 738, time: 1325147643
line: 763, time: 1325147643
line: 772, time: 1325147643
line: 786, time: 1325147643
line: 797, time: 1325147643
These results are quite rough, but looks like ~5 seconds are spent in
InitializeAsanInterceptors() (which calls INTERCEPT_FUNCTION many times) and
another 3 seconds are in multiple calls to INTERCEPT_FUNCTION.
So, yes, the hypothesis about slow mach_override is correct.
I haven't got any valuable information from Shark yet. The top line (~18-20%)
is usually ml_set_interrupts_enabled (which means many profiler ticks occured
while the program was in the kernel), most of other lines relate to kernel
code, too.
The most interesting part is vm_allocate(), which is called a number of times
for each interceptor -- this is most likely to be the culprit.
Original comment by ramosian.glider@gmail.com
on 29 Dec 2011 at 8:48
Another interesting experiment was to count the number of vm_allocate calls in
mach_override_ptr().
================================
Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision
147308)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,9 +451,11 @@
int allocated = 0;
vm_map_t task_self = mach_task_self();
+ fprintf(stderr, "vm_allocates follow\n");
while( !err && !allocated && page != last ) {
err = vm_allocate( task_self, &page, pageSize, 0 );
+ fprintf(stderr, "vm_allocate\n");
if( err == err_none )
allocated = 1;
else if( err == KERN_NO_SPACE ) {
================================
$ ./t > log 2>&1
$ cat log | grep "vm_allocates follow" | wc
48 96 960
$ cat log | grep "vm_allocate$" | wc
3146952 3146952 37763424
Original comment by ramosian.glider@gmail.com
on 29 Dec 2011 at 9:45
For 32 bits that's only 1176 calls to vm_allocate() -- no surprise everything
is ok.
Original comment by ramosian.glider@gmail.com
on 29 Dec 2011 at 9:47
Loop perforation in action: we can easily speed up this code by 4x (that's
427414 calls to vm_allocate, so it is not the bottleneck anymore):
Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision
147338)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,16 +451,18 @@
int allocated = 0;
vm_map_t task_self = mach_task_self();
+ fprintf(stderr, "vm_allocates follow\n");
while( !err && !allocated && page != last ) {
err = vm_allocate( task_self, &page, pageSize, 0 );
+ fprintf(stderr, "vm_allocate\n");
if( err == err_none )
allocated = 1;
else if( err == KERN_NO_SPACE ) {
#if defined(__x86_64__)
- page -= pageSize;
+ page -= pageSize * 8;
#else
- page += pageSize;
+ page += pageSize * 8;
#endif
err = err_none;
=========================================
$ time ./t 2>/dev/null
real 0m2.129s
user 0m0.322s
sys 0m1.800s
Of course the fix should involve calling vm_allocate less often by grouping
several allocations together and/or caching the probe results for subsequent
calls to mach_override_ptr().
Original comment by ramosian.glider@gmail.com
on 29 Dec 2011 at 10:30
I've made ASan pre-allocate memory for mach_override_ptr using mmap, but it
still takes 1.3 seconds to run an empty program (versus 13 milliseconds on
32-bit Mac OS).
I've instrumented the code with profiling printfs and here's what I got:
sec: 1326380400, msec: 319867 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:394
sec: 1326380401, msec: 42812 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:510
-- that's __asan_init(), which takes 723 milliseconds to run (I've also seen
450 ms sometimes)
Some 560 ms are spent in InitializeAsanInterceptors():
sec: 1326380400, msec: 354748 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:451
sec: 1326380400, msec: 911536 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:456
, which calls mach_override_ptr for 26 times, that's 21 ms per call:
sec: 1326380400, msec: 366326 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 483865 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 507547 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 531243 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
Each time some 12 ms are spent on something that looks like a COW in
atomic_mov64:
908 void atomic_mov64(
909 uint64_t *targetAddress,
910 uint64_t value )
911 {
912 PROFILE_TIME();
913 *targetAddress = value;
914 PROFILE_TIME();
915 *targetAddress = value;
916 PROFILE_TIME();
917 }
(I've inserted the second access to make sure it's faster than the first one)
sec: 1326380400, msec: 495752 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:913
sec: 1326380400, msec: 507510 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:915
sec: 1326380400, msec: 507542 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:917
Some other write accesses to the library code may also take up to 20 ms, so do
system calls like vm_protect() (the total result depends on which library
functions are intercepted: further accesses to the same code pages may be
faster).
It's still not evident why the empty program takes additional 0.6 seconds after
__asan_init() has finished.
Dima suspects that this can be caused by delayed effects of copying or caching.
Original comment by gli...@chromium.org
on 12 Jan 2012 at 3:14
Attached is the Shark profile for this program.
Most of the time is spent in vm_map_lookup_locked, which is invoked by
user_trap() (50.9%) and exit() (23.9%)
Original comment by ramosian.glider@gmail.com
on 13 Jan 2012 at 9:35
Attachments:
Okay, we have two problems here.
First, mach_override_ptr is slow because of the free memory lookups that do too
many vm_allocate calls. This used to take up to 8 seconds on our machine. My
solution is to externalize the branch island allocator so that it can pre-map
some memory and minimize the allocation cost. The draft implementation has sped
up an empty asan_test64 run to some 0.8 seconds.
Second, allocating the shadow memory bloats the virtual page table and slows
down the lookups and the shutdown process. For example, the following program:
==============
#include <sys/mman.h>
int main() {
void *t = mmap(0, 0x00000fffffffffffUL, PROT_READ| PROT_WRITE,
MAP_ANON | MAP_PRIVATE | MAP_NORESERVE, -1, 0);
}
==============
, which maps runs for 0.55 seconds on our machine without AddressSanitizer.
Most of this time is spent in the virtual page table lookups on shutdown.
We do not know how to get rid of this lookup overhead right now (it is in fact
greater, because the lookups are also performed as the program runs).
Mapping the shadow memory before mach_override_ptr() makes the performance
worse:
real 0m1.300s
user 0m0.012s
sys 0m1.277s
versus
real 0m0.842s
user 0m0.011s
sys 0m0.828s
if the shadow is mapped after mach_override_ptr() calls.
Original comment by ramosian.glider@gmail.com
on 13 Jan 2012 at 11:31
The last thing to mention is that my measurements of mach_override_ptr
performance were done for the case of shadow memory mapping at the beginning of
__asan_init, so they are off a bit. With my allocator patc overriding functions
takes only 1 millisecond:
sec: 1326454369, msec: 633272 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:414
sec: 1326454369, msec: 634223 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:416
vs. 8-9 seconds without it:
sec: 1326454567, msec: 742921 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:414
sec: 1326454576, msec: 246927 at
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:416
Original comment by ramosian.glider@gmail.com
on 13 Jan 2012 at 11:37
As of r148116, the whole asan_test64 takes finite time (18 minutes) to pass
(32-bit tests run for 1 minute)
Further possible speed improvements will require mapping less virtual memory.
(using e.g. twice less memory should speed up the shutdown twice). This can be
accomplished in the following ways:
-- use a SEGV handler instead of pre-allocating all the shadow memory;
-- omit some of the shadow memory which is guaranteed to be not used by the tests;
-- use a greater shadow memory scale factor.
Original comment by ramosian.glider@gmail.com
on 13 Jan 2012 at 4:47
>> asan_test64 takes finite time (18 minutes)
Good!
>> -- use a SEGV handler instead of pre-allocating all the shadow memory;
>> -- omit some of the shadow memory which is guaranteed to be not used by the
tests;
>> -- use a greater shadow memory scale factor.
Any of the suggested solutions will end up testing something different from
what we ship to users.
Original comment by konstant...@gmail.com
on 13 Jan 2012 at 6:46
Okay, because this is a test-only problem, let's fix the tests.
I'll make the heavy death run in parallel -- hope that helps.
Original comment by ramosian.glider@gmail.com
on 15 Jan 2012 at 6:43
Looks like EXPECT_DEATH can't be called from multiple threads, because it
shares the |g_captured_stdout| and |g_captured_stderr| global variables.
Putting each EXPECT_DEATH call under a lock will effectively kill the
performance gain :(
Original comment by gli...@chromium.org
on 16 Jan 2012 at 11:47
I've also tried to use multiple processes to run death tests in parallel, but
it seems to slow down the execution even more.
Original comment by ramosian.glider@gmail.com
on 16 Jan 2012 at 12:46
If nothing else works, we can try this...
But we will have to make sure that at least some tests (e.g. output tests) run
in regular mode.
>> -- omit some of the shadow memory which is guaranteed to be not used by the
tests;
Original comment by konstant...@gmail.com
on 17 Jan 2012 at 7:30
Does asan on 64-bit Mac always have to run with ASLR off?
If yes, we can actually reduce the size of the shadow significantly.
I tried the patch below (not for commit!) and the 64-bit tests ran ~5x faster.
@@ -457,11 +458,22 @@
{
if (kLowShadowBeg != kLowShadowEnd) {
+ // 0x100000000000
+ // 0x11ffffffffff
// mmap the low shadow plus one page.
- ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, kLowShadowEnd);
+ uintptr_t low_end = kLowShadowEnd;
+ if (1 && __WORDSIZE == 64) {
+ low_end = 0x101fffffffffULL;
+ }
+
+ ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, low_end);
}
// mmap the high shadow.
- ReserveShadowMemoryRange(kHighShadowBeg, kHighShadowEnd);
+ uintptr_t high_shadow = kHighShadowBeg;
+ if (1 && __WORDSIZE == 64) {
+ high_shadow = 0x1f8000000000ULL;
+ }
+ ReserveShadowMemoryRange(high_shadow, kHighShadowEnd);
// protect the gap
void *prot = AsanMprotect(kShadowGapBeg, kShadowGapEnd - kShadowGapBeg + 1);
CHECK(prot == (void*)kShadowGapBeg);
Original comment by konstant...@gmail.com
on 31 Jan 2012 at 2:51
Yes, we strictly need ASLR off. Otherwise it's possible to have the code
segment overwritten.
Are you going to use this just for the tests?
Original comment by ramosian.glider@gmail.com
on 31 Jan 2012 at 8:09
I would highly prefer to have no difference between tests and non-tests.
Original comment by konstant...@gmail.com
on 31 Jan 2012 at 5:47
The solution in #c15 is actually risky.
The ideal situation (which we have now) is when all memory is one of
- legal application memory
- legal shadow memory
- forbidden memory (mapped with PROT_NONE)
#c15 violates this.
As I just experimented, mmap with PROT_NONE is as expensive as mmap with
PROT_READ|PROT_WRITE
Original comment by konstant...@gmail.com
on 24 Feb 2012 at 9:06
Yes, you're right. If any of the shadow memory pages is unmapped, someone may
occasionally mmap it from the client code.
We can hardly prevent it: the only solution I can think of is to wrap mmap and
manage the virtual memory table ourselves, which will be probably slower than
doing that in the kernel.
Original comment by ramosian.glider@gmail.com
on 27 Feb 2012 at 8:11
Original comment by konstant...@gmail.com
on 22 May 2012 at 8:47
btw, http://openradar.appspot.com/radar?id=1634406
Original comment by konstant...@gmail.com
on 27 Jun 2012 at 7:02
The remaining performance issues are minor on 10.7 and 10.8, so reducing the
priority.
Original comment by ramosian.glider@gmail.com
on 29 Nov 2012 at 1:45
Current asan startup/shutdown on Mac > 10.7 time is ~0.3 seconds
This is much worse than on Linux, but still tolerable.
I think we can close this issue.
Original comment by konstant...@gmail.com
on 18 Feb 2013 at 6:49
Original issue reported on code.google.com by
konstant...@gmail.com
on 28 Dec 2011 at 11:59