viralcode / address-sanitizer

Automatically exported from code.google.com/p/address-sanitizer
1 stars 0 forks source link

slow asan start-up on Mac 64-bit #24

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
On Mac 64-bit asan start-up is insanely slow. 
I suspect this is caused by mach_override, please investigate.

Original issue reported on code.google.com by konstant...@gmail.com on 28 Dec 2011 at 11:59

GoogleCodeExporter commented 9 years ago
I've compiled a small program that just returns 0.

$ time ./t

real    0m8.253s
user    0m1.065s
sys 0m7.165s

Then I've injected lines printing time(NULL) into __asan_init:

$ ./t
line: 649, time: 1325147635
line: 654, time: 1325147635
line: 658, time: 1325147635
line: 692, time: 1325147635
line: 700, time: 1325147635
line: 703, time: 1325147640
line: 706, time: 1325147640
line: 733, time: 1325147643
line: 738, time: 1325147643
line: 763, time: 1325147643
line: 772, time: 1325147643
line: 786, time: 1325147643
line: 797, time: 1325147643

These results are quite rough, but looks like ~5 seconds are spent in 
InitializeAsanInterceptors() (which calls INTERCEPT_FUNCTION many times) and 
another 3 seconds are in multiple calls to INTERCEPT_FUNCTION.

So, yes, the hypothesis about slow mach_override is correct.

I haven't got any valuable information from Shark yet. The top line (~18-20%) 
is usually ml_set_interrupts_enabled (which means many profiler ticks occured 
while the program was in the kernel), most of other lines relate to kernel 
code, too.
The most interesting part is vm_allocate(), which is called a number of times 
for each interceptor -- this is most likely to be the culprit.

Original comment by ramosian.glider@gmail.com on 29 Dec 2011 at 8:48

GoogleCodeExporter commented 9 years ago
Another interesting experiment was to count the number of vm_allocate calls in 
mach_override_ptr().

================================
Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision 
147308)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,9 +451,11 @@
            int allocated = 0;
            vm_map_t task_self = mach_task_self();

+      fprintf(stderr, "vm_allocates follow\n");
            while( !err && !allocated && page != last ) {

                err = vm_allocate( task_self, &page, pageSize, 0 );
+        fprintf(stderr, "vm_allocate\n");
                if( err == err_none )
                    allocated = 1;
                else if( err == KERN_NO_SPACE ) {
================================

$ ./t > log 2>&1
$ cat log | grep "vm_allocates follow" | wc
      48      96     960
$ cat log | grep "vm_allocate$" | wc
 3146952 3146952 37763424

Original comment by ramosian.glider@gmail.com on 29 Dec 2011 at 9:45

GoogleCodeExporter commented 9 years ago
For 32 bits that's only 1176 calls to vm_allocate() -- no surprise everything 
is ok.

Original comment by ramosian.glider@gmail.com on 29 Dec 2011 at 9:47

GoogleCodeExporter commented 9 years ago
Loop perforation in action: we can easily speed up this code by 4x (that's 
427414 calls to vm_allocate, so it is not the bottleneck anymore):

Index: projects/compiler-rt/lib/asan/mach_override/mach_override.c
===================================================================
--- projects/compiler-rt/lib/asan/mach_override/mach_override.c (revision 
147338)
+++ projects/compiler-rt/lib/asan/mach_override/mach_override.c (working copy)
@@ -451,16 +451,18 @@
            int allocated = 0;
            vm_map_t task_self = mach_task_self();

+      fprintf(stderr, "vm_allocates follow\n");
            while( !err && !allocated && page != last ) {

                err = vm_allocate( task_self, &page, pageSize, 0 );
+        fprintf(stderr, "vm_allocate\n");
                if( err == err_none )
                    allocated = 1;
                else if( err == KERN_NO_SPACE ) {
 #if defined(__x86_64__)
-                   page -= pageSize;
+                   page -= pageSize * 8;
 #else
-                   page += pageSize;
+                   page += pageSize * 8;
 #endif
                    err = err_none;

=========================================
$ time ./t 2>/dev/null

real    0m2.129s
user    0m0.322s
sys 0m1.800s

Of course the fix should involve calling vm_allocate less often by grouping 
several allocations together and/or caching the probe results for subsequent 
calls to mach_override_ptr().

Original comment by ramosian.glider@gmail.com on 29 Dec 2011 at 10:30

GoogleCodeExporter commented 9 years ago
I've made ASan pre-allocate memory for mach_override_ptr using mmap, but it 
still takes 1.3 seconds to run an empty program (versus 13 milliseconds on 
32-bit Mac OS).

I've instrumented the code with profiling printfs and here's what I got:

sec: 1326380400, msec: 319867 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:394
sec: 1326380401, msec: 42812 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:510

-- that's __asan_init(), which takes 723 milliseconds to run (I've also seen 
450 ms sometimes)

Some 560 ms are spent in InitializeAsanInterceptors():
sec: 1326380400, msec: 354748 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:451
sec: 1326380400, msec: 911536 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:456
, which calls mach_override_ptr for 26 times, that's 21 ms per call:

sec: 1326380400, msec: 366326 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 483865 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 507547 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214
sec: 1326380400, msec: 531243 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:214

Each time some 12 ms are spent on something that looks like a COW in 
atomic_mov64:

908 void atomic_mov64(
909     uint64_t *targetAddress,
910     uint64_t value )
911 {   
912   PROFILE_TIME();
913     *targetAddress = value;
914   PROFILE_TIME();
915     *targetAddress = value;
916   PROFILE_TIME();
917 } 
(I've inserted the second access to make sure it's faster than the first one)

sec: 1326380400, msec: 495752 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:913
sec: 1326380400, msec: 507510 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:915
sec: 1326380400, msec: 507542 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/mach_o
verride/mach_override.c:917

Some other write accesses to the library code may also take up to 20 ms, so do 
system calls like vm_protect() (the total result depends on which library 
functions are intercepted: further accesses to the same code pages may be 
faster).

It's still not evident why the empty program takes additional 0.6 seconds after 
__asan_init() has finished.
Dima suspects that this can be caused by delayed effects of copying or caching.

Original comment by gli...@chromium.org on 12 Jan 2012 at 3:14

GoogleCodeExporter commented 9 years ago
Attached is the Shark profile for this program.
Most of the time is spent in vm_map_lookup_locked, which is invoked by 
user_trap() (50.9%) and exit() (23.9%)

Original comment by ramosian.glider@gmail.com on 13 Jan 2012 at 9:35

Attachments:

GoogleCodeExporter commented 9 years ago
Okay, we have two problems here.

First, mach_override_ptr is slow because of the free memory lookups that do too 
many vm_allocate calls. This used to take up to 8 seconds on our machine. My 
solution is to externalize the branch island allocator so that it can pre-map 
some memory and minimize the allocation cost. The draft implementation has sped 
up an empty asan_test64 run to some 0.8 seconds.

Second, allocating the shadow memory bloats the virtual page table and slows 
down the lookups and the shutdown process. For example, the following program:
==============
#include <sys/mman.h>
int main() {
 void *t = mmap(0, 0x00000fffffffffffUL, PROT_READ| PROT_WRITE,
MAP_ANON | MAP_PRIVATE | MAP_NORESERVE, -1, 0);
}
==============
, which maps runs for 0.55 seconds on our machine without AddressSanitizer. 
Most of this time is spent in the virtual page table lookups on shutdown.
We do not know how to get rid of this lookup overhead right now (it is in fact 
greater, because the lookups are also performed as the program runs).

Mapping the shadow memory before mach_override_ptr() makes the performance 
worse:

real    0m1.300s
user    0m0.012s
sys 0m1.277s

versus 

real    0m0.842s
user    0m0.011s
sys 0m0.828s

if the shadow is mapped after mach_override_ptr() calls.

Original comment by ramosian.glider@gmail.com on 13 Jan 2012 at 11:31

GoogleCodeExporter commented 9 years ago
The last thing to mention is that my measurements of mach_override_ptr 
performance were done for the case of shadow memory mapping at the beginning of 
__asan_init, so they are off a bit. With my allocator patc overriding functions 
takes only 1 millisecond:

sec: 1326454369, msec: 633272 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:414
sec: 1326454369, msec: 634223 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:416

vs. 8-9 seconds without it:
sec: 1326454567, msec: 742921 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:414
sec: 1326454576, msec: 246927 at 
/Users/glider/src/asan/asan-llvm-trunk/llvm/projects/compiler-rt/lib/asan/asan_r
tl.cc:416

Original comment by ramosian.glider@gmail.com on 13 Jan 2012 at 11:37

GoogleCodeExporter commented 9 years ago
As of r148116, the whole asan_test64 takes finite time (18 minutes) to pass 
(32-bit tests run for 1 minute)

Further possible speed improvements will require mapping less virtual memory.
(using e.g. twice less memory should speed up the shutdown twice). This can be 
accomplished in the following ways:

 -- use a SEGV handler instead of pre-allocating all the shadow memory;
 -- omit some of the shadow memory which is guaranteed to be not used by the tests;
 -- use a greater shadow memory scale factor.

Original comment by ramosian.glider@gmail.com on 13 Jan 2012 at 4:47

GoogleCodeExporter commented 9 years ago
>> asan_test64 takes finite time (18 minutes)
Good! 

>> -- use a SEGV handler instead of pre-allocating all the shadow memory;
>> -- omit some of the shadow memory which is guaranteed to be not used by the 
tests;
>> -- use a greater shadow memory scale factor.

Any of the suggested solutions will end up testing something different from 
what we ship to users. 

Original comment by konstant...@gmail.com on 13 Jan 2012 at 6:46

GoogleCodeExporter commented 9 years ago
Okay, because this is a test-only problem, let's fix the tests.
I'll make the heavy death run in parallel -- hope that helps.

Original comment by ramosian.glider@gmail.com on 15 Jan 2012 at 6:43

GoogleCodeExporter commented 9 years ago
Looks like EXPECT_DEATH can't be called from multiple threads, because it 
shares the  |g_captured_stdout| and |g_captured_stderr| global variables. 
Putting each EXPECT_DEATH call under a lock will effectively kill the 
performance gain :(

Original comment by gli...@chromium.org on 16 Jan 2012 at 11:47

GoogleCodeExporter commented 9 years ago
I've also tried to use multiple processes to run death tests in parallel, but 
it seems to slow down the execution even more.

Original comment by ramosian.glider@gmail.com on 16 Jan 2012 at 12:46

GoogleCodeExporter commented 9 years ago
If nothing else works, we can try this... 
But we will have to make sure that at least some tests (e.g. output tests) run 
in regular mode.
>> -- omit some of the shadow memory which is guaranteed to be not used by the 
tests;

Original comment by konstant...@gmail.com on 17 Jan 2012 at 7:30

GoogleCodeExporter commented 9 years ago
Does asan on 64-bit Mac always have to run with ASLR off? 

If yes, we can actually reduce the size of the shadow significantly. 
I tried the patch below (not for commit!) and the 64-bit tests ran ~5x faster. 

@@ -457,11 +458,22 @@

   {
     if (kLowShadowBeg != kLowShadowEnd) {
+      // 0x100000000000
+      // 0x11ffffffffff
       // mmap the low shadow plus one page.
-      ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, kLowShadowEnd);
+      uintptr_t low_end = kLowShadowEnd;
+      if (1 && __WORDSIZE == 64) {
+        low_end = 0x101fffffffffULL;
+      }
+
+      ReserveShadowMemoryRange(kLowShadowBeg - kPageSize, low_end);
     }
     // mmap the high shadow.
-    ReserveShadowMemoryRange(kHighShadowBeg, kHighShadowEnd);
+    uintptr_t high_shadow = kHighShadowBeg;
+    if (1 && __WORDSIZE == 64) {
+      high_shadow = 0x1f8000000000ULL;
+    }
+    ReserveShadowMemoryRange(high_shadow, kHighShadowEnd);
     // protect the gap
     void *prot = AsanMprotect(kShadowGapBeg, kShadowGapEnd - kShadowGapBeg + 1);
     CHECK(prot == (void*)kShadowGapBeg);

Original comment by konstant...@gmail.com on 31 Jan 2012 at 2:51

GoogleCodeExporter commented 9 years ago
Yes, we strictly need ASLR off. Otherwise it's possible to have the code 
segment overwritten.
Are you going to use this just for the tests?

Original comment by ramosian.glider@gmail.com on 31 Jan 2012 at 8:09

GoogleCodeExporter commented 9 years ago
I would highly prefer to have no difference between tests and non-tests. 

Original comment by konstant...@gmail.com on 31 Jan 2012 at 5:47

GoogleCodeExporter commented 9 years ago
The solution in #c15 is actually risky. 
The ideal situation (which we have now) is when all memory is one of 
 - legal application memory 
 - legal shadow memory
 - forbidden memory (mapped with PROT_NONE)
#c15 violates this. 

As I just experimented, mmap with PROT_NONE is as expensive as mmap with 
PROT_READ|PROT_WRITE

Original comment by konstant...@gmail.com on 24 Feb 2012 at 9:06

GoogleCodeExporter commented 9 years ago
Yes, you're right. If any of the shadow memory pages is unmapped, someone may 
occasionally mmap it from the client code.
We can hardly prevent it: the only solution I can think of is to wrap mmap and 
manage the virtual memory table ourselves, which will be probably slower than 
doing that in the kernel.

Original comment by ramosian.glider@gmail.com on 27 Feb 2012 at 8:11

GoogleCodeExporter commented 9 years ago

Original comment by konstant...@gmail.com on 22 May 2012 at 8:47

GoogleCodeExporter commented 9 years ago
btw, http://openradar.appspot.com/radar?id=1634406

Original comment by konstant...@gmail.com on 27 Jun 2012 at 7:02

GoogleCodeExporter commented 9 years ago
The remaining performance issues are minor on 10.7 and 10.8, so reducing the 
priority.

Original comment by ramosian.glider@gmail.com on 29 Nov 2012 at 1:45

GoogleCodeExporter commented 9 years ago
Current asan startup/shutdown on Mac > 10.7 time is ~0.3 seconds
This is much worse than on Linux, but still tolerable. 
I think we can close this issue.

Original comment by konstant...@gmail.com on 18 Feb 2013 at 6:49