Closed ramosian-glider closed 4 years ago
@dvyukov's proposal:
we could also try to (1) choose a root cache, (2) choose if we want to steal from the root cache, or from one of its memcg subcaches, (3) if we decided to steal from memcg subcache, pick a random one from cachep->memcg_params.memcg_caches->entries and steal from it instead
This gives every memcg cache an equal chance to be picked, although for certain caches allocations may never happen.
This also does not require changing mm/memcontrol.c
What I don't like in this idea is that memcg caches may come and go at any time, mm/memcontrol.c
uses RCU to keep them, and caches are sometimes created lazily.
To be able to access every possible memcg cache, we need something similar to https://elixir.bootlin.com/linux/v5.7-rc1/source/mm/memcontrol.c#L2826, but with current
being unavailable.
My proposal:
What if we just always steal the root cache freelist, and if an allocation goes through memcg, then we check whether the root cache is stolen?
I.e. we can modify memcg_kmem_get_cache(cachep)
to check whether cachep->freelist
is stolen by KFENCE, and if it is, return a cache that performs a KFENCE allocation.
This approach is substantially easier to implement (no need to track memcg caches in KFENCE at all), but it treats every tree of memcg caches as a single cache, which may change the probabilities for individual memcg caches.
On the other hand, memcg seems to be underused for kernel allocations (https://twitter.com/dvyukov/status/1073627872366088193), so this will probably increase the number of successful allocations done via KFENCE.
My proposal:
What if we just always steal the root cache freelist, and if an allocation goes through memcg, then we check whether the root cache is stolen?
I.e. we can modify
memcg_kmem_get_cache(cachep)
to check whethercachep->freelist
is stolen by KFENCE, and if it is, return a cache that performs a KFENCE allocation. This approach is substantially easier to implement (no need to track memcg caches in KFENCE at all), but it treats every tree of memcg caches as a single cache, which may change the probabilities for individual memcg caches.
I think, in the interest of simplicity, that we should try this first. Assuming there is no serious performance impact.
I think, in the interest of simplicity, that we should try this first. Assuming there is no serious performance impact.
Just figured out we cannot simply replace cachep
with kfence_slab_cache
in memcg_kmem_get_cache()
, because we need to somehow pass the allocation size to kfence_alloc_and_fix_freelist()
.
We could return cachep
itself instead, as its freelist is also NULL and will get us onto the slow path. However there's no guarantee other allocation from cachep
won't race with the current one, making the freelist non-NULL.
Overall, in certain cases kfence_alloc_and_fix_freelist()
may fail for different reasons (we ran out of pages, or KFENCE got disabled).
In the normal case, when we are allocating from the root cache, we'll just proceed with slab allocation.
However for memcg we won't be able to do so if we replaced the memcg cache we found with something else.
What I'm thinking of right now is:
memcg_kmem_get_cache(cachep)
find memcg_cache
for us;cachep
- if so, restore the stolen freelist and steal one from memcg_cache
(also record memcg_cache
instead of cachep
in KFENCE);We may actually need to keep track of both root and memcg caches, because cache teardown hook does not know anything about memcg caches.
Right, I think that will work. We just need to lazily steal the freelist from the memcg cache.
Now that we use static keys to instrument slab_alloc_node(), this is not a problem anymore.
Right now KFENCE does not support memcg, which (when enabled for a particular process) essentially routes every kernel allocation done via a particular slub cache to a memcg version of that cache (https://elixir.bootlin.com/linux/v5.7-rc1/source/mm/slab.h#L572):
Because those memcg cache versions aren't registered with KFENCE, we are unable to steal freelists from them, so we cannot replace memcg allocations with our allocations.