slow asan start-up on Mac 64-bit

ramosian-glider commented 9 years ago

Originally reported on Google Code with ID 24

On Mac 64-bit asan start-up is insanely slow. 
I suspect this is caused by mach_override, please investigate.

Reported by konstantin.s.serebryany on 2011-12-28 23:59:31

ramosian-glider commented 9 years ago

I've compiled a small program that just returns 0.

$ time ./t

real    0m8.253s
user    0m1.065s
sys 0m7.165s

Then I've injected lines printing time(NULL) into __asan_init:

$ ./t
line: 649, time: 1325147635
line: 654, time: 1325147635
line: 658, time: 1325147635
line: 692, time: 1325147635
line: 700, time: 1325147635
line: 703, time: 1325147640
line: 706, time: 1325147640
line: 733, time: 1325147643
line: 738, time: 1325147643
line: 763, time: 1325147643
line: 772, time: 1325147643
line: 786, time: 1325147643
line: 797, time: 1325147643

These results are quite rough, but looks like ~5 seconds are spent in InitializeAsanInterceptors()
(which calls INTERCEPT_FUNCTION many times) and another 3 seconds are in multiple calls
to INTERCEPT_FUNCTION.

So, yes, the hypothesis about slow mach_override is correct.

I haven't got any valuable information from Shark yet. The top line (~18-20%) is usually
ml_set_interrupts_enabled (which means many profiler ticks occured while the program
was in the kernel), most of other lines relate to kernel code, too.
The most interesting part is vm_allocate(), which is called a number of times for each
interceptor -- this is most likely to be the culprit.

Reported by ramosian.glider on 2011-12-29 08:48:21

Status changed: Started

ramosian-glider commented 9 years ago

Another interesting experiment was to count the number of vm_allocate calls in mach_override_ptr().

================================
Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision 147308)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,9 +451,11 @@
            int allocated = 0;
            vm_map_t task_self = mach_task_self();

+      fprintf(stderr, "vm_allocates follow\n");
            while( !err && !allocated && page != last ) {

                err = vm_allocate( task_self, &page, pageSize, 0 );
+        fprintf(stderr, "vm_allocate\n");
                if( err == err_none )
                    allocated = 1;
                else if( err == KERN_NO_SPACE ) {
================================

$ ./t > log 2>&1
$ cat log | grep "vm_allocates follow" | wc
      48      96     960
$ cat log | grep "vm_allocate$" | wc
 3146952 3146952 37763424

Reported by ramosian.glider on 2011-12-29 09:45:07

ramosian-glider commented 9 years ago

For 32 bits that's only 1176 calls to vm_allocate() -- no surprise everything is ok.

Reported by ramosian.glider on 2011-12-29 09:47:14

ramosian-glider commented 9 years ago

Loop perforation in action: we can easily speed up this code by 4x (that's 427414 calls
to vm_allocate, so it is not the bottleneck anymore):

Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision 147338)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,16 +451,18 @@
            int allocated = 0;
            vm_map_t task_self = mach_task_self();

+      fprintf(stderr, "vm_allocates follow\n");
            while( !err && !allocated && page != last ) {

                err = vm_allocate( task_self, &page, pageSize, 0 );
+        fprintf(stderr, "vm_allocate\n");
                if( err == err_none )
                    allocated = 1;
                else if( err == KERN_NO_SPACE ) {
 #if defined(__x86_64__)
-                   page -= pageSize;
+                   page -= pageSize * 8;
 #else
-                   page += pageSize;
+                   page += pageSize * 8;
 #endif
                    err = err_none;

=========================================
$ time ./t 2>/dev/null

real    0m2.129s
user    0m0.322s
sys 0m1.800s

Of course the fix should involve calling vm_allocate less often by grouping several
allocations together and/or caching the probe results for subsequent calls to mach_override_ptr().

Reported by ramosian.glider on 2011-12-29 10:30:46

ramosian-glider commented 9 years ago

I've made ASan pre-allocate memory for mach_override_ptr using mmap, but it still takes
1.3 seconds to run an empty program (versus 13 milliseconds on 32-bit Mac OS).

I've instrumented the code with profiling printfs and here's what I got:

sec: 1326380400, msec: 319867 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:394
sec: 1326380401, msec: 42812 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:510

-- that's __asan_init(), which takes 723 milliseconds to run (I've also seen 450 ms
sometimes)

Some 560 ms are spent in InitializeAsanInterceptors():
sec: 1326380400, msec: 354748 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:451
sec: 1326380400, msec: 911536 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:456
, which calls mach_override_ptr for 26 times, that's 21 ms per call:

sec: 1326380400, msec: 366326 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:214
sec: 1326380400, msec: 483865 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:214
sec: 1326380400, msec: 507547 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:214
sec: 1326380400, msec: 531243 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:214

Each time some 12 ms are spent on something that looks like a COW in atomic_mov64:

908 void atomic_mov64(
909     uint64_t *targetAddress,
910     uint64_t value )
911 {   
912   PROFILE_TIME();
913     *targetAddress = value;
914   PROFILE_TIME();
915     *targetAddress = value;
916   PROFILE_TIME();
917 } 
(I've inserted the second access to make sure it's faster than the first one)

sec: 1326380400, msec: 495752 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:913
sec: 1326380400, msec: 507510 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:915
sec: 1326380400, msec: 507542 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_override/mach_override.c:917

Some other write accesses to the library code may also take up to 20 ms, so do system
calls like vm_protect() (the total result depends on which library functions are intercepted:
further accesses to the same code pages may be faster).

It's still not evident why the empty program takes additional 0.6 seconds after __asan_init()
has finished.
Dima suspects that this can be caused by delayed effects of copying or caching.

Reported by glider@chromium.org on 2012-01-12 15:14:49

ramosian-glider commented 9 years ago

Attached is the Shark profile for this program.
Most of the time is spent in vm_map_lookup_locked, which is invoked by user_trap()
(50.9%) and exit() (23.9%)

Reported by ramosian.glider on 2012-01-13 09:35:01

Attachment: shark-asan.txt

ramosian-glider commented 9 years ago

Okay, we have two problems here.

First, mach_override_ptr is slow because of the free memory lookups that do too many
vm_allocate calls. This used to take up to 8 seconds on our machine. My solution is
to externalize the branch island allocator so that it can pre-map some memory and minimize
the allocation cost. The draft implementation has sped up an empty asan_test64 run
to some 0.8 seconds.

Second, allocating the shadow memory bloats the virtual page table and slows down the
lookups and the shutdown process. For example, the following program:
==============
#include <sys/mman.h>
int main() {
 void *t = mmap(0, 0x00000fffffffffffUL, PROT_READ| PROT_WRITE,
MAP_ANON | MAP_PRIVATE | MAP_NORESERVE, -1, 0);
}
==============
, which maps runs for 0.55 seconds on our machine without AddressSanitizer. Most of
this time is spent in the virtual page table lookups on shutdown.
We do not know how to get rid of this lookup overhead right now (it is in fact greater,
because the lookups are also performed as the program runs).

Mapping the shadow memory before mach_override_ptr() makes the performance worse:

real    0m1.300s
user    0m0.012s
sys 0m1.277s

versus 

real    0m0.842s
user    0m0.011s
sys 0m0.828s

if the shadow is mapped after mach_override_ptr() calls.

Reported by ramosian.glider on 2012-01-13 11:31:20

ramosian-glider commented 9 years ago

The last thing to mention is that my measurements of mach_override_ptr performance were
done for the case of shadow memory mapping at the beginning of __asan_init, so they
are off a bit. With my allocator patc overriding functions takes only 1 millisecond:

sec: 1326454369, msec: 633272 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:414
sec: 1326454369, msec: 634223 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:416

vs. 8-9 seconds without it:
sec: 1326454567, msec: 742921 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:414
sec: 1326454576, msec: 246927 at /Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:416

Reported by ramosian.glider on 2012-01-13 11:37:03

ramosian-glider commented 9 years ago

As of r148116, the whole asan_test64 takes finite time (18 minutes) to pass (32-bit
tests run for 1 minute)

Further possible speed improvements will require mapping less virtual memory.
(using e.g. twice less memory should speed up the shutdown twice). This can be accomplished
in the following ways:

 -- use a SEGV handler instead of pre-allocating all the shadow memory;
 -- omit some of the shadow memory which is guaranteed to be not used by the tests;
 -- use a greater shadow memory scale factor.

Reported by ramosian.glider on 2012-01-13 16:47:23

ramosian-glider commented 9 years ago

>> asan_test64 takes finite time (18 minutes)
Good! 

>> -- use a SEGV handler instead of pre-allocating all the shadow memory;
>> -- omit some of the shadow memory which is guaranteed to be not used by the tests;
>> -- use a greater shadow memory scale factor.

Any of the suggested solutions will end up testing something different from what we
ship to users.

Reported by konstantin.s.serebryany on 2012-01-13 18:46:44

ramosian-glider commented 9 years ago

Okay, because this is a test-only problem, let's fix the tests.
I'll make the heavy death run in parallel -- hope that helps.

Reported by ramosian.glider on 2012-01-15 18:43:14

ramosian-glider commented 9 years ago

Looks like EXPECT_DEATH can't be called from multiple threads, because it shares the
 |g_captured_stdout| and |g_captured_stderr| global variables. Putting each EXPECT_DEATH
call under a lock will effectively kill the performance gain :(

Reported by glider@chromium.org on 2012-01-16 11:47:35

ramosian-glider commented 9 years ago

I've also tried to use multiple processes to run death tests in parallel, but it seems
to slow down the execution even more.

Reported by ramosian.glider on 2012-01-16 12:46:42

ramosian-glider commented 9 years ago

If nothing else works, we can try this... 
But we will have to make sure that at least some tests (e.g. output tests) run in regular
mode.
>> -- omit some of the shadow memory which is guaranteed to be not used by the tests;

Reported by konstantin.s.serebryany on 2012-01-17 19:30:51

ramosian-glider commented 9 years ago

Does asan on 64-bit Mac always have to run with ASLR off? 

If yes, we can actually reduce the size of the shadow significantly. 
I tried the patch below (not for commit!) and the 64-bit tests ran ~5x faster. 

@@ -457,11 +458,22 @@

   {
     if (kLowShadowBeg != kLowShadowEnd) {
+      // 0x100000000000
+      // 0x11ffffffffff
       // mmap the low shadow plus one page.
-      ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, kLowShadowEnd);
+      uintptr_t low_end = kLowShadowEnd;
+      if (1 && __WORDSIZE == 64) {
+        low_end = 0x101fffffffffULL;
+      }
+
+      ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, low_end);
     }
     // mmap the high shadow.
-    ReserveShadowMemoryRange(kHighShadowBeg, kHighShadowEnd);
+    uintptr_t high_shadow = kHighShadowBeg;
+    if (1 && __WORDSIZE == 64) {
+      high_shadow = 0x1f8000000000ULL;
+    }
+    ReserveShadowMemoryRange(high_shadow, kHighShadowEnd);
     // protect the gap
     void *prot = AsanMprotect(kShadowGapBeg, kShadowGapEnd - kShadowGapBeg + 1);
     CHECK(prot == (void*)kShadowGapBeg);

Reported by konstantin.s.serebryany on 2012-01-31 02:51:07

ramosian-glider commented 9 years ago

Yes, we strictly need ASLR off. Otherwise it's possible to have the code segment overwritten.
Are you going to use this just for the tests?

Reported by ramosian.glider on 2012-01-31 08:09:48

ramosian-glider commented 9 years ago

I would highly prefer to have no difference between tests and non-tests.

Reported by konstantin.s.serebryany on 2012-01-31 17:47:12

ramosian-glider commented 9 years ago

The solution in #c15 is actually risky. 
The ideal situation (which we have now) is when all memory is one of 
 - legal application memory 
 - legal shadow memory
 - forbidden memory (mapped with PROT_NONE)
#c15 violates this. 

As I just experimented, mmap with PROT_NONE is as expensive as mmap with PROT_READ|PROT_WRITE

Reported by konstantin.s.serebryany on 2012-02-24 21:06:36

ramosian-glider commented 9 years ago

Yes, you're right. If any of the shadow memory pages is unmapped, someone may occasionally
mmap it from the client code.
We can hardly prevent it: the only solution I can think of is to wrap mmap and manage
the virtual memory table ourselves, which will be probably slower than doing that in
the kernel.

Reported by ramosian.glider on 2012-02-27 08:11:49

ramosian-glider commented 9 years ago

Reported by konstantin.s.serebryany on 2012-05-22 08:47:47

Labels added: OpSys-OSX

ramosian-glider commented 9 years ago

btw, http://openradar.appspot.com/radar?id=1634406

Reported by konstantin.s.serebryany on 2012-06-27 07:02:13

ramosian-glider commented 9 years ago

The remaining performance issues are minor on 10.7 and 10.8, so reducing the priority.

Reported by ramosian.glider on 2012-11-29 13:45:12

Labels added: Priority-Low
Labels removed: Priority-Medium

ramosian-glider commented 9 years ago

Current asan startup/shutdown on Mac > 10.7 time is ~0.3 seconds
This is much worse than on Linux, but still tolerable. 
I think we can close this issue.

Reported by konstantin.s.serebryany on 2013-02-18 06:49:57

Status changed: Fixed

ramosian-glider commented 9 years ago

Adding Project:AddressSanitizer as part of GitHub migration.

Reported by ramosian.glider on 2015-07-30 09:12:58

Labels added: ProjectAddressSanitizer

ramosian-glider / sanitizer-issues

slow asan start-up on Mac 64-bit #24