[kfence] Proper TLB flushing

ramosian-glider commented 4 years ago

Every time we allocate/deallocate a KFENCE object, we need to protect/unprotect the corresponding page, so that a newly allocated object is accessible by every CPU, and accesses to a freed object trigger a page fault. Right now this is done with __flush_tlb_one_kernel(), which is incorrect, because it performs an INVLPG instruction, effectively invalidating a single TLB entry for the current CPU. This potentially means that other CPUs may still see stale TLB entries, which may lead to both false positives and false negatives. This function is also x86-specific.

A portable function that flushes a kernel memory range on every CPU is called flush_tlb_kernel_range(). Unfortunately on x86 it requires calling on_each_cpu(), which cannot be done with interrupts disabled, so we cannot use it.

ramosian-glider commented 4 years ago

@thejh's suggestions from an internal discussion:

On X86, perhaps you could do the TLB flush locally immediately and send IPI to the other cores, but not wait for the results (so that this has a chance of working in non-sleepable context). And on ARM64, you can use their architectural TLB flushing stuff to do a global flush immediately. But I think this is fundamentally going to have to require new arch-specific code; I think there isn't really an existing API for "I want to to a TLB flush relatively quickly, but I can't wait".

and

I think the optimal approach would be to always send an IPI that says "please flush this address", but not wait for the IPI to finish. (If someone on another core is racing with you, it doesn't really matter whether you wait after removing the PTE or not.)

ramosian-glider commented 4 years ago

@dvyukov pointed out we care most about false positives, so we could still perform CPU-local flushes and do an extra TLB flush in kfence_handle_page_fault() to ensure a particular page is protected before reporting an error on it. This won't help with false negatives, i.e. cases where e.g. CPU1 accesses an object freed by CPU0.

ramosian-glider commented 4 years ago

Another possible solution is to maintain a shared array of protected/unprotected bits for every KFENCE page, which every CPU can poll for updates and do local TLB flushes based on that information. This won't fix false negatives in 100% cases, but may guarantee that every state change becomes globally visible after guaranteed time.

ramosian-glider commented 4 years ago

An interesting question is how fast page protection changes propagate without explicit global flushes. In theory, a context switch (at least between the userspace and the kernel space) flushes the TLB on the current CPU.

dvyukov commented 4 years ago

Another possible solution is to maintain a shared array of protected/unprotected bits for every KFENCE page, which every CPU can poll for updates and do local TLB flushes based on that information. This won't fix false negatives in 100% cases, but may guarantee that every state change becomes globally visible after guaranteed time.

This is an interesting way to deal with false negatives. Do we need an array of bits? It seems that a single epoch counter is enough. Each free bumps the counter, each CPU has a local copy, if local copy != global epoch, flush TLB and update the local copy. A single counter check may be cheap enough to do in interrupts, which should probabilistically eliminate all false negatives

dvyukov commented 4 years ago

An interesting question is how fast page protection changes propagate without explicit global flushes.

My limited understanding of CPUs is that cache entries stay in cache until they are evicted for some reason. If the workload is such that the DTLB entry is not evicted, then it can stay there for arbitrary amount of time. However, due to spectre/meltdown kernel may now do explicit flushes on switches between kernel/userspace and/or different processes if the corresponding configs are enabled.

ramosian-glider commented 4 years ago

My limited understanding of CPUs is that cache entries stay in cache until they are evicted for some reason. If the workload is such that the DTLB entry is not evicted, then it can stay there for arbitrary amount of time. However, due to spectre/meltdown kernel may now do explicit flushes on switches between kernel/userspace and/or different processes if the corresponding configs are enabled.

There is some crazy mechanism called lazy TLB flushing, but I might be misunderstanding how it works. Perhaps it is only relevant to userspace mappings.

ramosian-glider commented 4 years ago

This is an interesting way to deal with false negatives. Do we need an array of bits? It seems that a single epoch counter is enough. Each free bumps the counter, each CPU has a local copy, if local copy != global epoch, flush TLB and update the local copy. A single counter check may be cheap enough to do in interrupts, which should probabilistically eliminate all false negatives

This is sure possible, although adding checks to every interrupt will slow the whole kernel down even if KFENCE is off.

thejh commented 4 years ago

An interesting question is how fast page protection changes propagate without explicit global flushes. In theory, a context switch (at least between the userspace and the kernel space) flushes the TLB on the current CPU.

Are you sure? As far as I know, context switches between userspace and the kernel normally don't flush at all. Writing to CR3 used to flush the TLB, but nowadays even that TLB flush can often be optimized away thanks to the ASID optimization.

dvyukov commented 4 years ago

This is an interesting way to deal with false negatives. Do we need an array of bits? It seems that a single epoch counter is enough. Each free bumps the counter, each CPU has a local copy, if local copy != global epoch, flush TLB and update the local copy. A single counter check may be cheap enough to do in interrupts, which should probabilistically eliminate all false negatives

What rate of sampled allocations are we aiming at? I afraid that even with relatively low rate, we may deplete all pages for persistent objects. And if rate will be something like 1 per 10 minutes, then we can afford a IPIs? In the end mmap's in highly threaded do IPIs for this all the time.

Also it would be good to understand if such false negatives can really happen without being probabilistically detected with reasonable probability. We already have extremely low detection probability, so if so some obscure class of bugs the probability is 3x lower, okay, I am fine with it. Especially if fixing it is non-trivial/costly. To be undetected all the time, the UAF needs some quite special properties.

ramosian-glider commented 4 years ago

Are you sure? As far as I know, context switches between userspace and the kernel normally don't flush at all. Writing to CR3 used to flush the TLB, but nowadays even that TLB flush can often be optimized away thanks to the ASID optimization.

No, I just read about lazy TLB flushing yesterday, and I think I got it wrong.

ramosian-glider commented 4 years ago

Another possible option is to delay a call to flush_tlb_kernel_range() on the current CPU to a context where on_each_cpu() is possible.

melver commented 4 years ago

One possible compromise:

Call flush_tlb_kernel_range() when we know interrupts are not disabled. This means if an allocation is done from a context with interrupts disabled, we may briefly suffer false negatives if an access is from another CPU. The common-case, however, should not be affected.

We should maintain statistics on the number of allocations that didn't do a proper TLB flush to assess how severe this is. Based on that information we can then refine the implementation. I wouldn't want to prematurely optimize this without data.

To deal with possible false positives, lazily update TLB in kfence_handle_page_fault(). Actually, on x86 that's not even necessary because the fault handler on x86 checks spurious_kernel_fault() before ever getting to kfence_handle_page_fault() (if I read x86's fault.c right). If we assume that architectures that will support kfence support a mechanism like spurious_kernel_fault(), kfence_handle_page_fault() might just need a WARN() if the address doesn't belong to a kfence red-zone (or just return and not warn if we want to support all architectures).

melver commented 4 years ago

We should rely as much as possible on x86's <asm/set_memory.h>, as it does the right thing.

Using set_memory_np() does the "right thing" when protecting a page, however, it seems there is no function to unprotect. I have implemented such a function (set_memory_p()) in arch/x86/mm/pat/set_memory.c:

int set_memory_p(unsigned long addr, int numpages)                                                                                                                                                                                                        
{                                                                                                                                                                                                                                                         
     return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_PRESENT), 0);                                                                                                                                                                         
}

Unfortunately, kfence-test seems to fail when switching to use the set_memory functions instead of set_pte+flush_tlb_one_kernel. (Why?)

Given that our current <asm/kfence.h> for x86 passes all tests, improving on it is probably low priority.

melver commented 3 years ago

This was discussed on the LKML: https://lkml.kernel.org/r/CANpmjNOZtkFcyL8FTRTZ6j2yqCOb2Hgsy8eF8n5zgd7mDYezkw@mail.gmail.com

google / kernel-sanitizers

[kfence] Proper TLB flushing #57