google / sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
Other
11.19k stars 1.01k forks source link

AFL Fork Performance #939

Open jonwilson030981 opened 6 years ago

jonwilson030981 commented 6 years ago

Let me start by saying that address sanitiser is great. It is easy to use and I have used it a few times now and I have found some good results. Apologies for the long message, but I thought it might be helpful to explain my thought process unless I have missed something.

I am now using address sanitiser in combination with AFL to analyse a component, but I am seeing relatively modest performance. The component in question has a number of challenges when fuzzing, it is quite complex and the initialisation and tear down time is significant (in the order of a few seconds). Whilst I have been able to identify one of the key areas where state is modified by the test cases and save and restore this between tests, there is still other state scattered throughout the component which impacts path stability.

Using the AFL fork server helps this situation as it lets me bypass the initialisation and teardown by forking the process just before the test case is executed. I have also tried configuring the number of test cases to run consecutively before AFL forks the original process once again, but the as the number of consecutive test cases increases, stability decreases rapidly. Whilst forking is much quicker than the initialisation and tear down would be, it is still slower than I would like (10s of executions per second). I can see that my fuzzing process is spending most of its time in the kernel rather than in user-mode executing the code under test.

I interested to understand the impact of the need to copy the pageables used to describe the large virtual address space (around 20TB) used by address sanitiser for its shadow memory. This is necessary since in the child, the pages must be marked readonly such that access can cause a page fault and trapped so the memory content can be duplicated. There are a few work arounds for this problem. I could compile for 32-bit so that the virtual address space is smaller, but the component has quite a large memory footprint and also I have had issues in the past with code not functioning the same in 32-bit and 64-bit due to pointer sizes etc. Also, I could run the component without address sanitiser to perform the path discovery and use this to build a corpus which can then be used later with address sanitiser enabled to find any defects, but I would prefer just to set it up and let it run whilst I move onto other work. I could just throw more compute at the problem, but this seems like giving up.

Whilst more virtual address space efficient designs would be possible, that would increase the complexity of detecting errors and so any performance gains from not forking a large address space may well be lost in the increased complexity of performing checks during the majority of memory access. No doubt the current design has been chosen as a good fit for the majority of use-cases. I am considering whether I could further optimise the way in which the virtual address space is allocated for my particular use case.

From reading the source code, I believe that address sanitiser is not explicitly defining the page sizes it uses when using mmap to reserve the shadow memory. However, looking at the source, it seems a fair amount of thought has gone into it, so I would be interested in understanding the design. I have not read the kernel source yet (I am running on a stock Ubuntu 16.04), so I am not sure exactly how the pageables will be constructed to describe this range when the MAP_HUGETLB flag is not passed particularly as the shadow ranges are not 2MB or 1GB aligned. I am also yet to find a good means to inspect the page tables of a running process. If the kernel were to use 4k pages to describe these ranges then the physical memory used to describe them would be quite large.

I understand that the kernel also implements Transparent Huge Pages. This mechanism means that larger ranges of memory may be allocated using 2MB pages, or subsequently that virtually consecutive 4k pages with the same permissions may be defragmented in 2MB pages by the khugepage daemon. In its default configuration, however, it doesn’t appear to use 1Gb pages. I am interested whether page table entries describing the shadow space used by address sanitiser will be optimal in terms of using the maximum page size. Obviously, when bits in the shadow space are touched, a page will need to be allocated in order to store the memory. This would mean that it would likely then be necessary to reconfigure the page tables describing that virtual address to use smaller mappings, or to allocate large amounts of physical memory to back it.

I would be grateful if you have any insight into the above or any suggestions of how I could confirm or tackle the problem.

kcc commented 6 years ago

We haven't studied the affect of asan on the page table too much -- what we have seen is that the 20Tb shadow costs literally nothing on Linux and I am surprised that you see a cost there. If you are confident this costs so much I'd like to see a minimal reproducer.

Meanwhile, two orthogonal questions:

If your program is clean enough to run multiple instances of the fuzz target with a single setup/teardown this might be all you need.

jonwilson030981 commented 6 years ago

Thanks for getting back to me. I have tried a 32-bit build (running on 64-bit Linux) with similar levels of performance. I have also tried a simple test program in AFL which does not use address sanitizer, but similarly mmaps a large region. It's performance seems to be good. Thus it seems perhaps (given the reduced address space) that forking the shadow memory may not be the issue. I will carry on investigating to try to find the root cause by trying to simplify my setup toward a minimal reproducer until something changes.

It seems odd, however, that my testing shows the poor performance to be down to kernel time. My code under test doesn't carry out any syscalls (it's just processing data), so I can only perhaps it is down to page faults (perhaps as a side effect of some instrumentation).

I have tried AFL persistent mode, setting the fork to occur only after 1000 iterations. Which greatly improves performance (1000's of iterations per second). But this comes seemingly at the expense of stability, even with the fork set to every 2 iterations, I am seeing only 50% stability. Even when I remove the calls to the code under test, the stability doesn't seem to improve. This is something else I am keen to investigate.

LibFuzzer has similar levels of performance to AFL in persistent mode, but I am concerned whether it too is affected by stability (and what the cause of this is). The coverage output however, seems to indicate that it is coping fairly well.

I can only assume that I must have something mis-configured. I will carry on trying to figure it out and let you know if I find the culprit.

a1xndr commented 4 years ago

Hello Jon, Have you had luck finding a remedy for this issue? I am, similarly, fuzzing a target with a large initialization time. Since there is no straightforward way for me to ensure that application state doesn't leak between fuzz runs, I am relying on a deferred fork server. With ASAN, there is a ~20x slowdown. Although I expect some overhead due to normal ASAN functionality - without forking it is much less pronounced. This is considering the fact that the non-asan program already has a relatively large 100-200mb RSS. I have similarly looked at Hugepages, without any luck. I have considered using userfaultfd/write-protected pages to reproduce the COW properties of fork() within the original process, but I'm not sure how this would play with ASAN's shadow memory. By the way, a 32-bit build provided a significant performance increase for me (2-3x).