[kfence] Alternative designs for consideration

melver commented 4 years ago

One of the primary drivers behind KFence's design is to avoid any overhead in the SL[AU]B allocator fast-paths.

The current design therefore "steals" freelists, which causes the allocator to enter the slow-path, where we then force a KFence allocation.

Initially this was fairly simple. However, it is becoming more complex, since there may be unforeseen issues due to:

many potential kmem-caches,
many per-CPU freelists,

Which causes issues with predicting when we actually allocate from KFence, because some caches and freelists (e.g. more from one CPU than another) may be more utilized than others.

Therefore, we should still investigate if there are designs that still provide similar performance characteristics, while being simpler.

[Below we'll add one reply per potential other design (edit each reply to add new details).]

melver commented 4 years ago

Using tracepoints. CONFIG_TRACEPOINT=y is the default on all kernels of interest.

There are various allocator tracepoints that provide provide the address of the allocation, the size, and flags.

We could use a timer to set up tracepoint callbacks, which enter KFence, disable the tracepoint again and perform a KFence allocation.

The currently unknown is how to change the allocation to be guarded, since tracepoints cannot change the return value of the functions of interest.

dvyukov commented 4 years ago

There may be some other branches that may at least relief us from dealing with cpu caches. E.g. this one for slab: https://elixir.bootlin.com/linux/v5.7-rc4/source/mm/slab.c#L3237 But unclear if slub has such branch.

ramosian-glider commented 4 years ago

There may be some other branches that may at least relief us from dealing with cpu caches. E.g. this one for slab: https://elixir.bootlin.com/linux/v5.7-rc4/source/mm/slab.c#L3237

IIUC triggering this branch also requires tracking every cache in the kernel.

dvyukov commented 4 years ago

There may be some other branches that may at least relief us from dealing with cpu caches. E.g. this one for slab: https://elixir.bootlin.com/linux/v5.7-rc4/source/mm/slab.c#L3237

IIUC triggering this branch also requires tracking every cache in the kernel.

Yes. I mean only CPU caches here. CPU caches is one of the things mentioned as motivation for this issues. If we get rid of the "CPU caches" part, then we need to track all root slabs (memcg slabs accessible from root slabs.

ramosian-glider commented 4 years ago

Hijacking a single cache for an extended period of time (suggested by Marco)

Instead of picking a random cache every time, we can increase the probability of allocating a KFENCE object from a single memory cache by only pushing a single object to the freelist on the slow path (or just not restore the freelist in guarded_alloc() for several iterations). This may let the heartbeat thread go, and will probably simplify assessing probabilities.

dvyukov commented 4 years ago

It would be useful to assess if we fall onto slow path for every freelist at least once in a while. Because if we do, we can just go with it -- if we are on slow path, do a guarded allocation with some probability, but don't pick a single freelist, don't force anything asynchronously, etc. We could combine this with populating only fraction of freelist on refill. Say, we allocated a new page, split it into 100 new object, but instead of pushing all 100 onto freelist, we push only 42 and keep remaining 58 somewhere on the side until the next refill. This should add more randomness and increase rate of falling onto slow path.

ramosian-glider commented 4 years ago

Another idea from Kostya Serebryany is to use the entropy of objects returned by SLUB:

void *alloc(long size) {
  void *res = actual_malloc(size);
  if (((long)(res) & 0xfff0000) == 0)
    return sampled_malloc(size, res);
  return res;
}
Here, we allocate memory as usual, and afterwards jump to a sampling code if some of the bits of the allocated pointer are zero (i.e. we piggyback on the randomness of the allocated pointers)
   9:   a9 00 00 ff 0f          test   $0xfff0000,%eax
   e:   74 02                   je     12 <alloc+0x12>
Now the check is still two extra instructions, but no memory accesses (and on x86_64 they are shorter)

The downsides of this approach are:

semi-random bits distribution: freelists are populated with objects that have identical upper bits. It's hard to pick a bit pattern for which works for both 8-byte and 512-byte objects, but maybe a hash function depending on all bits may work;
when a single object is allocated and freed too often, it will either be always sampled or ignored by the sampling check

ramosian-glider commented 4 years ago

A potential improvement of this idea is using the upper bits of the freelist pointers to store sampling counters. When populating the freelist, store truly random numbers in the upper bits and pick the sampling path e.g. if the number is zero. This is still prone to the situation when a single object is allocated and deallocated multiple times. For that case we can decrement the counter value, so that it hits zero at some point.

ramosian-glider commented 4 years ago

Another idea from Kostya Serebryany is to use the entropy of objects returned by SLUB:

I've been trying to come up with a hash function that maps pointers to numbers uniformly distributed between 0 and 255, but failed. According to my observations, all pointers have bits 10-13 and 17-20 set in roughly 50% cases, but simply taking those bits led to very big variance. Another problem is that fixing a range won't let us easily change the sample parameter on the fly.

dvyukov commented 4 years ago

Another problem is that fixing a range won't let us easily change the sample parameter on the fly.

I think this "looking at the bits of the pointer" should only trigger slow path, but not necessary sampling. 1/256 is extremely high sampling rate. Once we are on the slow path, we apply additional level of sampling. That second level of sampling can be changed dynamically. We need this first level of sampling only to amortize cost of real sampling. Real sampling can be as cheap as reading and decrementing a per-cpu counter, so fast path needs to be faster than this to make sense.

dvyukov commented 4 years ago

I've been trying to come up with a hash function that maps pointers to numbers uniformly distributed between 0 and 255

Does this include "When populating the freelist, store truly random numbers in the upper bits"? If we don't mix in something to the pointer value, I don't see how this is better than checking if low 12 bits are all 0s.

ramosian-glider commented 4 years ago

Does this include "When populating the freelist, store truly random numbers in the upper bits"?

No, it does not. This is a bit more complicated, as we cannot dereference these mangled pointers. I'm trying that as well, as it will potentially give us more flexibility.

ramosian-glider commented 4 years ago

I've implemented a prototype in which the top two bytes of every SLUB freelist item are ignored. Thus it's now possible to store data in them and use that data in slab_alloc_node() without an extra load. But the question remains, how do we initialize these two bytes.

For newly populated freelist this can be done in ___slab_alloc() (i.e. in the allocation slow path). The top bytes must be re-initialized every time we put them into the freelist, so something needs to be done in do_slab_free() as well. Moreover, to mitigate the situation in which a single object is allocated/freed multiple times, we need to assign different values to the same pointer every time it is freed. As we are trying to avoid an extra load on the slab_alloc_node() fast path, I don't think adding a per-CPU counter to do_slab_free() is bearable.

melver commented 4 years ago

Using static keys (jump labels). Prototype: https://github.com/google/kasan/pull/91

ramosian-glider commented 4 years ago

Shall we close this bug now?

melver commented 4 years ago

Done.

google / kernel-sanitizers

[kfence] Alternative designs for consideration #65