google / kernel-sanitizers

Linux Kernel Sanitizers, fast bug-detectors for the Linux kernel
https://google.github.io/kernel-sanitizers/
437 stars 87 forks source link

[kfence] memcg support #59

Closed ramosian-glider closed 4 years ago

ramosian-glider commented 4 years ago

Right now KFENCE does not support memcg, which (when enabled for a particular process) essentially routes every kernel allocation done via a particular slub cache to a memcg version of that cache (https://elixir.bootlin.com/linux/v5.7-rc1/source/mm/slab.h#L572):

static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
                             gfp_t flags)
{
...
    if (memcg_kmem_enabled() &&
        ((flags & __GFP_ACCOUNT) || (s->flags & SLAB_ACCOUNT)))
        return memcg_kmem_get_cache(s);

    return s;
}

Because those memcg cache versions aren't registered with KFENCE, we are unable to steal freelists from them, so we cannot replace memcg allocations with our allocations.

ramosian-glider commented 4 years ago

@dvyukov's proposal:

we could also try to (1) choose a root cache, (2) choose if we want to steal from the root cache, or from one of its memcg subcaches, (3) if we decided to steal from memcg subcache, pick a random one from cachep->memcg_params.memcg_caches->entries and steal from it instead

This gives every memcg cache an equal chance to be picked, although for certain caches allocations may never happen. This also does not require changing mm/memcontrol.c

What I don't like in this idea is that memcg caches may come and go at any time, mm/memcontrol.c uses RCU to keep them, and caches are sometimes created lazily. To be able to access every possible memcg cache, we need something similar to https://elixir.bootlin.com/linux/v5.7-rc1/source/mm/memcontrol.c#L2826, but with current being unavailable.

ramosian-glider commented 4 years ago

My proposal:

What if we just always steal the root cache freelist, and if an allocation goes through memcg, then we check whether the root cache is stolen?

I.e. we can modify memcg_kmem_get_cache(cachep) to check whether cachep->freelist is stolen by KFENCE, and if it is, return a cache that performs a KFENCE allocation. This approach is substantially easier to implement (no need to track memcg caches in KFENCE at all), but it treats every tree of memcg caches as a single cache, which may change the probabilities for individual memcg caches.

On the other hand, memcg seems to be underused for kernel allocations (https://twitter.com/dvyukov/status/1073627872366088193), so this will probably increase the number of successful allocations done via KFENCE.

melver commented 4 years ago

My proposal:

What if we just always steal the root cache freelist, and if an allocation goes through memcg, then we check whether the root cache is stolen?

I.e. we can modify memcg_kmem_get_cache(cachep) to check whether cachep->freelist is stolen by KFENCE, and if it is, return a cache that performs a KFENCE allocation. This approach is substantially easier to implement (no need to track memcg caches in KFENCE at all), but it treats every tree of memcg caches as a single cache, which may change the probabilities for individual memcg caches.

I think, in the interest of simplicity, that we should try this first. Assuming there is no serious performance impact.

ramosian-glider commented 4 years ago

I think, in the interest of simplicity, that we should try this first. Assuming there is no serious performance impact.

Just figured out we cannot simply replace cachep with kfence_slab_cache in memcg_kmem_get_cache(), because we need to somehow pass the allocation size to kfence_alloc_and_fix_freelist().

We could return cachep itself instead, as its freelist is also NULL and will get us onto the slow path. However there's no guarantee other allocation from cachep won't race with the current one, making the freelist non-NULL.

ramosian-glider commented 4 years ago

Overall, in certain cases kfence_alloc_and_fix_freelist() may fail for different reasons (we ran out of pages, or KFENCE got disabled). In the normal case, when we are allocating from the root cache, we'll just proceed with slab allocation. However for memcg we won't be able to do so if we replaced the memcg cache we found with something else.

What I'm thinking of right now is:

We may actually need to keep track of both root and memcg caches, because cache teardown hook does not know anything about memcg caches.

melver commented 4 years ago

Right, I think that will work. We just need to lazily steal the freelist from the memcg cache.

ramosian-glider commented 4 years ago

Now that we use static keys to instrument slab_alloc_node(), this is not a problem anymore.