Add support for use of GLIBC's rseq

google / tcmalloc

Apache License 2.0

4.41k stars 480 forks source link

Add support for use of GLIBC's rseq #144

Open avieira-arm opened 2 years ago

avieira-arm commented 2 years ago

From version 2.35 GLIBC incorporated rseq and it does a systemcall to register the thread's abi struct at the start of each thread. It would be nice to add support to TCMalloc to use GLIBC's rseq abi. See https://www.gnu.org/software/libc/manual/html_mono/libc.html#Restartable-Sequences for the documentation on how they implemented it.

For the time being we can enable TCMalloc's rseq when linked with GLIBC 2.35+ by disabling GLIBC's rseq call setting the environment variable: 'GLIBC_TUNABLES=glibc.pthread.rseq=0'.

fweimer-rh commented 2 years ago

Please note that some distributions have backported rseq support into earlier downstream versions. For example our CentOS/RHEL version based on glibc 2.34 has it, with the glibc 2.35 ABI. The rseq kernel facility is the only way to get a useful sched_getcpu implementation on AArch64, which is why I expect more widespread deployment.

compudj commented 3 months ago

You may want to have a look at how librseq integrates with glibc for rseq registration, this could help solving this at the tcmalloc level more robustly. Reference: https://git.kernel.org/pub/scm/libs/librseq/librseq.git/

The important bits are here: https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq.c

lano1106 commented 3 months ago

any idea how this can be addressed?

I have read the comment in internal/percpu.h, it does not seems like the current tcmalloc rseq usage allows sharing rseq with others by repurposing the cpu_id_start field...

compudj commented 3 months ago

This misuse of cpu_id_start is outside of the ABI contract with the kernel. See include/uapi/linux/rseq.h:

        /*
         * Restartable sequences cpu_id_start field. Updated by the
         * kernel. Read by user-space with single-copy atomicity
         * semantics. This field should only be read by the thread which
         * registered this data structure. Aligned on 32-bit. Always
         * contains a value in the range of possible CPUs, although the
         * value may not be the actual current CPU (e.g. if rseq is not
         * initialized). This CPU number value should always be compared
         * against the value of the cpu_id field before performing a rseq
         * commit or returning a value read from a data structure indexed
         * using the cpu_id_start value.
         */
        __u32 cpu_id_start;

So based on this ABI contract, userspace should never write to it. Writing to it basically makes tcmalloc incompatible with all other rseq users. We should discuss how we can extend rseq with new fields to provide the kind of signal required by tcmalloc. We should also find ways to prevent similar abuse of the rseq ABI in the future.

lano1106 commented 3 months ago

my humble opinion is that the cpu_id_start bit hack is an optimization but could easily be corrected.

my understanding is that tcmalloc_sampler, tcmalloc_slabs and __rseq_abi are positioned to share the same cacheline...

if __rseq_abi is 20-24 bytes long, there is still room to store an extra field after __rseq_abi. It could be named last_cpu_id_start. The test would become cpu_id_start != last_cpu_id_start instead of testing the bit overwrite.

that being said, something that I suspect tcmalloc will never give up is the ability to perform the rseq registration as it gives the freedom to decide what is around __rseq_abi.

In that regard, it is an interesting use-case... for a peaceful cohabitation with glibc, it would be nice if glibc would allow this... It should be possible to have the option to register rseq and init the glibc rseq variables: rseq_offset, __rseq_size, rseq_flags

https://www.gnu.org/software/libc/manual/html_node/Restartable-Sequences.html

at least, glibc RSEQ_SIG and TCMALLOC_PERCPU_RSEQ_SIGNATURE are equal...

lano1106 commented 3 months ago

I have just spotted:

struct kernel_rseq {
  unsigned cpu_id_start;
  unsigned cpu_id;
  unsigned long long rseq_cs;
  unsigned flags;
  unsigned padding[2];
  // This is a prototype extension to the rseq() syscall.  Since a process may
  // run on only a few cores at a time, we can use a dense set of "v(irtual)
  // cpus."  This can reduce cache requirements, as we only need N caches for
  // the cores we actually run on simultaneously, rather than a cache for every
  // physical core.
  union {
    struct {
      short numa_node_id;
      short vcpu_id;
    };
    int vcpu_flat;
  };
} __attribute__((aligned(4 * sizeof(unsigned long long))));

Google must have internally a patched custom kernel where rseq is updating the vcpu field so your kernel_rseq is also 32 bytes long.

no problem... In tcmalloc::tcmalloc_internal::Sampler, there is a 7 bytes hole:

  // Saved copy of the sampling interval from when we actually set
  // bytes_until_sample_. This allows us to properly calculate the sample
  // weight of the first sample after the sampling interval is changed.
  ssize_t sample_interval_;

  uint64_t rnd_;  // Cheap random number generator
  bool initialized_;

  // Bytes until we sample next.
  //
  // More specifically when bytes_until_sample_ is X, we can allocate
  // X bytes without triggering sampling; on the (X+1)th allocated
  // byte, the containing allocation will be sampled.
  //
  // Always non-negative with only very brief exceptions (see
  // DecrementFast{,Finish}, so casting to size_t is ok.
  ssize_t bytes_until_sample_;

so last_cpu_id_start could be stored in that hole... or... do the MSB hack with sampleinterval, rnd_ or bytes_untilsample to store initialized to get an extra full 8 bytes... Beside, I am not 100% too sure why initialization could not have been performed during the construction instead of having to test initialized every time Sampler::RecordAllocationSlow() is called...

lano1106 commented 3 months ago

another thought that did come to my mind... it is super cool to have the tcmalloc_slabs pointer lowest half value be updated by the kernel... I am impressed by the creativity behind the setup... it is amazing!

but, without being an assembly optimization expert, it seems like there is a penalty from reading a 64bits pointer that is not aligned on a 8 bytes boundary, no?

if it was possible to squeeze 4 bytes from Sampler class (I think it is), tcmalloc_slabs could be aligned and not overlapping with rseq_abi... with a union, you could access the pointer lower half value to compare it efficiently with cpu_id_start

dvyukov commented 3 months ago

but, without being an assembly optimization expert, it seems like there is a penalty from reading a 64bits pointer that is not aligned on a 8 bytes boundary, no?

There does not seem to be any.

if it was possible to squeeze 4 bytes from Sampler class (I think it is), tcmalloc_slabs could be aligned and not overlapping with rseq_abi

If it's not overlapping, then the whole idea breaks down. We rely on the high part of tcmalloc_slabs being overwritten on preemption.

lano1106 commented 3 months ago

but, without being an assembly optimization expert, it seems like there is a penalty from reading a 64bits pointer that is not aligned on a 8 bytes boundary, no?

There does not seem to be any.

I guess that it is correct for most processors if the access is within a cacheline but this might NOT be true on every processor.

check the comments below the first answer here: https://stackoverflow.com/questions/12491578/whats-the-actual-effect-of-successful-unaligned-accesses-on-x86

maybe I do not understand the whole idea correctly but my understanding is that you check if you have been preempted and moved by doing:

(slabs & (1 << 63)) == 0

or else you know you have not been preempted, you clear the MSB and use the slabs ptr as-is.

if you had enough room to store the 4 bytes (you do) current cpu_id_start in last_cpu_id_start, in my mind

doing

__rseq_abi.cpu_id_start != last_cpu_id_start

would be equivalent...

Update: ok. I think that I get it... having equal CPU ids does not tell you have been preempted but the bit trick does... Assuming the cpu_id_start is updated by the kernel at each context switch...

dvyukov commented 3 months ago

I guess that it is correct for most processors if the access is within a cacheline but this might NOT be true on every processor.

This is all a big hack to begin with, if somebody wants to work on proper kernel support, I am all for it: https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com/

__rseq_abi.cpu_id_start != last_cpu_id_start

This is also additional instructions, and every instruction here is a huge pile of $$$ for us.

Btw, with the special kernel support, we can make the instruction sequence even better than it is now. If kernel will store userspace-provided value into tcmalloc_slabs pointer on preemption, then we can store a dedicated fake always empty slab on preemption, then we don't need preemption check at all, we can just opportunistically try to allocate, and then for the dedicated empty slab, it fails, we go to slow path using the existing "slab empty" branch.

lano1106 commented 3 months ago

This is also additional instructions, and every instruction here is a huge pile of $$$ for us.

I know this... this is why I am putting so much effort to integrate tcmalloc into my own project.

I suppose that you made the analysis about the instructions count by testing 1 bit vs comparing 2 integers. Intuitively, I would think that this is equivalent but it probably isn't....

anyway, I really appreciate your explanations and the context where your design decisions have been made.

thx for sharing them

BTW, I am reading your proposal to address the issue discussed here. I find your proposal interesting and reasonable... I am surprised to see that Mathieu did not comment it when you made it there...

one final comment... I am fascinated to see how the boundaries between userspace and kernel are getting broken down and the possibilities this is opening... I am seeing this phenomenom with rseq... and I am also an occasional io_uring contributor... What Linux is becoming is amazing, imho...

dvyukov commented 3 months ago

I suppose that you made the analysis about the instructions count by testing 1 bit vs comparing 2 integers.

I did not test actually, but I feel it will be more expensive. Currently we have:

movq 0x1234(%fs), %rax  // load tcmalloc_slabs from tls
btrq %rax               // test and reset high bit
jz ...

With comparing 2 integers we will have:

movq 0x2345(%fs), %rax  // load cached cpu
cmpq %rax, 0x4567(%fs)  // compare with rseq.cpu
jnz ...
movq 0x1234(%fs), %rax  // load tcmalloc_slabs from tls

So that's +1 instruction, +2 loads. And we also will need to fit that additional data into the same cache line.

And, yes, being able to detect preemption rather than just "same cpu" gives some additional interesting benefits. E.g. we can "virtual CPUs" or upstream "concurrency IDs" pure in user-space.

anyway, I really appreciate your explanations and the context where your design decisions have been made.

You are welcome.

lano1106 commented 3 months ago

I did some more research about glibc 2.40 usage of rseq:

$ find . -name '*.c' -exec grep -l rseq_area {} \;
./nptl/pthread_create.c
./sysdeps/nptl/dl-tls_init_tp.c
./sysdeps/unix/sysv/linux/tst-rseq.c
./sysdeps/unix/sysv/linux/sched_getcpu.c
./sysdeps/unix/sysv/linux/tst-rseq-disable.c

so the only place where rseq is used is in sched_getcpu().

My code is not calling sched_getcpu() and I did perf record my app for 30 minutes to be absolutely sure that no 3rd party libs was calling this function without my knowledge...

there is nothing in my program that is either calling sched_getcpu() or is accessing glibc rseq area through its exported symbols so, while not the ideal situation, there is nothing in my context that is a dealbreaker that makes this issue stopping me to grant exclusive access to tcmalloc of the rseq area...

I have no idea which SW is calling sched_getcpu() or accessing rseq through glibc but fortunately for tcmalloc, it appears to be not that much for the moment being...

dvyukov commented 3 months ago

sched_getcpu() should fall back to some a bit slower implementation if rseq is not available. So it should fine to disable glibc rseq with GLIBC_TUNABLES in this case.

compudj commented 3 months ago

BTW, I am reading your proposal to address the issue discussed here. I find your proposal interesting and reasonable... I am surprised to see that Mathieu did not comment it when you made it there...

I still have a note to provide feedback in my spare-time to-do list since last year, but unfortunately I've been busy with customer projects since then. I've used a bit of idle time in my vacation in the last 2 weeks to provide feedback, but it would certainly help me prioritize dedicating more time to these matters if we can put some kind of contractual agreement in place.

compudj commented 3 months ago

I guess that it is correct for most processors if the access is within a cacheline but this might NOT be true on every processor.

This is all a big hack to begin with, if somebody wants to work on proper kernel support, I am all for it: https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com/

I would like to discuss the possible alternative approaches with you. This starts by having a clear view of the requirement and define metrics that appropriately measure the fitness of the proposed approaches.

__rseq_abi.cpu_id_start != last_cpu_id_start

This is also additional instructions, and every instruction here is a huge pile of $$$ for us.

All instructions are not equal. Some take more cycles to execute, others can cause stalls, and so on. We need more relevant metrics than "number of instructions" and "number of instruction bytes", which measure actual overhead. What I am unsure about is whether the code we are talking about is inlined or not ?

Btw, with the special kernel support, we can make the instruction sequence even better than it is now. If kernel will store userspace-provided value into tcmalloc_slabs pointer on preemption, then we can store a dedicated fake always empty slab on preemption, then we don't need preemption check at all, we can just opportunistically try to allocate, and then for the dedicated empty slab, it fails, we go to slow path using the existing "slab empty" branch.

For something that would support multi-users, I basically see two approaches to this:

Have some kind of linked list of 32-bit/64-bit memory area to clear on preemption per process. This could be a new kind of rseq syscall behavior with new flags. The downside here is the amount of work to do on return to userspace, and complexity of kernel code to add/maintain.
Extend rseq with a new 32-bit field that would contain a counter to be incremented by userspace (either with a single increment instruction, or with a sequence of instructions using rseq), reset to 0 by the kernel on preemption, so many users can check if they are preempted after the increment by reading the field content.

I understand based on this discussion that solution (2) would require a few more instructions: the increment, a load, and a conditional branch. How much measurable performance impact would this really have ?

compudj commented 3 months ago

sched_getcpu() should fall back to some a bit slower implementation if rseq is not available. So it should fine to disable glibc rseq with GLIBC_TUNABLES in this case.

The current use of rseq by glibc is indeed limited to sched_getcpu(), but as soon as we are done upstreaming the extensible rseq integration and have all the basic pieces in place, you can expect the number of rseq usages to start increasing both in glibc and in other libraries/applications. So disabling rseq glibc integration with the tunable may be OK for specific use-cases in the short-term, but this is not future-proof at all.

dvyukov commented 3 months ago

All instructions are not equal. Some take more cycles to execute, others can cause stalls, and so on. We need more relevant metrics than "number of instructions" and "number of instruction bytes", which measure actual overhead. What I am unsure about is whether the code we are talking about is inlined or not ?

Yes, it's inlined into new/delete fast paths. All instructions here are already generally non-stalling (cached loads, predictable branches). A loop with tcmalloc new/delete takes <4ns per iteration on my machines. It's basically every instruction counts, esp the ones on the critical path.

dvyukov commented 3 months ago

For something that would support multi-users, I basically see two approaches to this:

Have some kind of linked list of 32-bit/64-bit memory area to clear on preemption per process. This could be a new kind of rseq syscall behavior with new flags. The downside here is the amount of work to do on return to userspace, and complexity of kernel code to add/maintain.

Extend rseq with a new 32-bit field that would contain a counter to be incremented by userspace (either with a single increment instruction, or with a sequence of instructions using rseq), reset to 0 by the kernel on preemption, so many users can check if they are preempted after the increment by reading the field content.

I understand based on this discussion that solution (2) would require a few more instructions: the increment, a load, and a conditional branch. How much measurable performance impact would this really have ?

Yes, it's hard choice. For 2 I think the kernel needs to increment b/c we again get back to non-composable libraries (which one will increment from 0?).

Overall anything that adds more instruction/makes the code slower will likely keep us on the current version. cc @ckennelly for this.

Also cc @melver for potential synergy with our scheduling extensions. I wonder if we can have a single kernel interface that would support both cached slab pointer and user-space CID assignment.

compudj commented 3 months ago

For something that would support multi-users, I basically see two approaches to this:

Have some kind of linked list of 32-bit/64-bit memory area to clear on preemption per process. This could be a new kind of rseq syscall behavior with new flags. The downside here is the amount of work to do on return to userspace, and complexity of kernel code to add/maintain.

Extend rseq with a new 32-bit field that would contain a counter to be incremented by userspace (either with a single increment instruction, or with a sequence of instructions using rseq), reset to 0 by the kernel on preemption, so many users can check if they are preempted after the increment by reading the field content.

I understand based on this discussion that solution (2) would require a few more instructions: the increment, a load, and a conditional branch. How much measurable performance impact would this really have ?

Yes, it's hard choice. For 2 I think the kernel needs to increment b/c we again get back to non-composable libraries (which one will increment from 0?).

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

Overall anything that adds more instruction/makes the code slower will likely keep us on the current version. cc @ckennelly for this.

Actually I'm not entirely sure what is fast-path or slow-path in option (2). I suspect that the increment of (thread_pointer() + __rseq_offset) + offsetof(struct rseq_abi, preempt_notif) from user-space only needs to happen when updating the cached value. Which would mean that the actual fast-path would need to load the cached value, then do a cmp on the new preempt_notif field, conditionally branch if it is zero. This may have to be done within a rseq critical section to make it atomic wrt preemption, so we add a store instruction to setup the critical section there.

dvyukov commented 3 months ago

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

But how do they understand they were rescheduled? When they see what value?

melver commented 3 months ago

Also cc @melver for potential synergy with our scheduling extensions. I wonder if we can have a single kernel interface that would support both cached slab pointer and user-space CID assignment.

I'm worried about the proliferation of complex ABIs that end up not working or have other shortcomings, but add lots of other complexity to the kernel. For example, while CIDs in general are nice to have (based on the vCPU idea), it turns out that a single CID assignment scheme is suboptimal.

We prototyped a user-space CID assignment scheme with BPF, using a very narrow ABI/protocol that is optimal for that usecase. But coming up with a generic ABI that is generally useful beyond that, and doesn't add unwanted complexity to the kernel is rather difficult. Ultimately different users might want to see more or less detailed scheduling state (was a thread of same process running on CPU, different process, when was the last time a thread from this process was running on this CPU, etc. etc.). It quickly becomes quite complicated.

FWIW - https://lore.kernel.org/all/CANpmjNO3m-f-Yg5EqTL3ktL2CqTw7v0EjHGVth7ssWJRnNz5xQ@mail.gmail.com/T/#u

In general, having the ability to prototype extensions with BPF may help eventually settle on a fixed ABI if that's still desireable.

If we do want a fixed ABI, it might help do the simplest thing that can mirror cpu_id_start usage of TCMalloc today, but beyond that it looks rather tricky.

compudj commented 3 months ago

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

But how do they understand they were rescheduled? When they see what value?

Here is the algorithm I have in mind (please correct me if there are holes in my understanding):

Populating a tcmalloc cache, AFAIU this is not a fast-path:

increment rseq area preempt_notif field (e.g. 'inc' instruction),
store tcmalloc cached value to TLS.

Fast-path accessing the cached value, confirming whether it is usable with the preempt_notif check:

load tcmalloc cached value from TLS,
load rseq area preempt_notif field
conditional branch if loaded rseq area preempt_notif field is zero.

Overall, user-space only increments the preempt_notif field (with single-instruction or rseq critical section), or load it for comparison against 0. The kernel only has to set the preempt_notif field back to 0 when returning to userspace after preemption.

dvyukov commented 3 months ago

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

But how do they understand they were rescheduled? When they see what value?

Here is the algorithm I have in mind (please correct me if there are holes in my understanding):

Populating a tcmalloc cache, AFAIU this is not a fast-path:

increment rseq area preempt_notif field (e.g. 'inc' instruction),

store tcmalloc cached value to TLS.

Fast-path accessing the cached value, confirming whether it is usable with the preempt_notif check:

load tcmalloc cached value from TLS,

load rseq area preempt_notif field

conditional branch if loaded rseq area preempt_notif field is zero.

Overall, user-space only increments the preempt_notif field (with single-instruction or rseq critical section), or load it for comparison against 0. The kernel only has to set the preempt_notif field back to 0 when returning to userspace after preemption.

Consider we have 2 libraries:

Lib1 increments 0->1, stores 1.
Lib2 increments 1->2, stores 2.
Kernel resets to 0.
Lib2 notices counter 0, increments 0->1, handles preemption.
Lib1 sees 1 and assumes no preemption. Bang!

compudj commented 3 months ago

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

But how do they understand they were rescheduled? When they see what value?

Here is the algorithm I have in mind (please correct me if there are holes in my understanding): Populating a tcmalloc cache, AFAIU this is not a fast-path:

increment rseq area preempt_notif field (e.g. 'inc' instruction),

store tcmalloc cached value to TLS.

Fast-path accessing the cached value, confirming whether it is usable with the preempt_notif check:

load tcmalloc cached value from TLS,

load rseq area preempt_notif field

conditional branch if loaded rseq area preempt_notif field is zero.

Overall, user-space only increments the preempt_notif field (with single-instruction or rseq critical section), or load it for comparison against 0. The kernel only has to set the preempt_notif field back to 0 when returning to userspace after preemption.

Consider we have 2 libraries:
* Lib1 increments 0->1, stores 1.

* Lib2 increments 1->2, stores 2.

* Kernel resets to 0.

* Lib2 notices counter 0, increments 0->1, handles preemption.

* Lib1 sees 1 and assumes no preemption. Bang!

You're right, I was missing that case. My jet-lagged brain is not 100%. :) This requires a bit more thinking then.

Can we instead use two new fields ? One is incremented by user-space and is a free-running counter as a sort-of generation counter (and remember it), and the second field is updated by the kernel with a snapshot of the generation counter when preemption happens. Assuming no overflow, user-space could check if the 2nd field >= its own remembered generation counter value. But this would require more than a simple comparison with zero on the fast path.

compudj commented 3 months ago

Just trying to outline option (1) a little more. Let's say we have two new rseq flags in enum rseq_flags: RSEQ_FLAG_REGISTER_CLEAR_ON_PREEMPT and RSEQ_FLAG_UNREGISTER_CLEAR_ON_PREEMPT. Those would take a userspace address to a structure.

Now the linked list used to chain the items could be either done by allocating kernel memory (which makes the memory footprint hard to track), or chained within userspace (which makes memory allocation easy to track, but we need to be mindful about linked list corruption by userspace, e.g. cycles).

I think for your use-case you'd need this registration to be specific to each given thread.

The registration/unregistration would take a pointer to e.g.:

struct rseq_clear_on_preempt {
    __u64 ptr;
    __u32 len; /* 4 or 8 bytes */
};

The kernel would then clear either 4 or 8 bytes on return-to-userspace after a preemption.

Should we limit this to an upper bound ? Else what happens if an application registers 10k of those for a given thread ?

If this is all chained from within userspace, rather than extending rseq with new register/unregister flags, we could also add a new field to struct rseq_abi which would be the top level pointer to that linked list.

compudj commented 3 months ago

Also cc @melver for potential synergy with our scheduling extensions. I wonder if we can have a single kernel interface that would support both cached slab pointer and user-space CID assignment.

I'm worried about the proliferation of complex ABIs that end up not working or have other shortcomings, but add lots of other complexity to the kernel. For example, while CIDs in general are nice to have (based on the vCPU idea), it turns out that a single CID assignment scheme is suboptimal.

I'm curious to hear more about the variety of concurrency IDs requirements you have observed. Note that I've recently submitted a patch series to make rseq mm_cid numa-aware: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/T/#t . I would like to heard your thoughts about this and what is missing there.

melver commented 3 months ago

I'm curious to hear more about the variety of concurrency IDs requirements you have observed. Note that I've recently submitted a patch series to make rseq mm_cid numa-aware: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/T/#t . I would like to heard your thoughts about this and what is missing there.

In practice, job schedulers will take care to schedule a process only on a subset of cores (that do not straddle a NUMA domain). For big jobs that own the machine, it might be worthwhile to have the NUMA-aware scheme.

But the much more common problem these days are increasingly complex cache hierarchies. Recent AMD server CPUs are one such example.

A more flexible CID allocation scheme might be worthwhile here, but that needs a way to either ask user space how it wants the CID space partitioned or some other way to tell that to the kernel. Having the ability to re-partition the CID space dynamically would be nice (similar to e.g. scheduler affinity masks).

If you have ideas how something like that could be realized upstream, that would be great. BPF was one idea we had, but it also has limitations vs. a purely user space scheme.

compudj commented 3 months ago

[...]

But the much more common problem these days are increasingly complex cache hierarchies. Recent AMD server CPUs are one such example.

A more flexible CID allocation scheme might be worthwhile here, but that needs a way to either ask user space how it wants the CID space partitioned or some other way to tell that to the kernel. Having the ability to re-partition the CID space dynamically would be nice (similar to e.g. scheduler affinity masks).

If you have ideas how something like that could be realized upstream, that would be great. BPF was one idea we had, but it also has limitations vs. a purely user space scheme.

From what you are saying, the issue that arises with recent AMD server CPUs is due to the large overhead of L3 (LLC) cache misses, am I correct ? If so, then I wonder if we could (optionally) tweak my numa-awareness patchset to make it consider each LLC for node-cpumask indexing instead of using the NUMA node ID. This would increase the number of cpumasks to track per-process (number of L3 domains vs number of NUMA nodes), but would on the other hand eliminate cache line bouncing across L3 domains due to concurrent accesses to the same mm_cid index from different L3 domains.

My objective here is to keep the mm_cid allocation flat (single-level indexing) from a userspace perspective, and handle all the complexity related to locality of the CID allocation within the kernel. This way userspace can benefit from locality gains without even needing to be aware neither of the topology nor of NUMA.

Thoughts ?

melver commented 3 months ago

[...]

But the much more common problem these days are increasingly complex cache hierarchies. Recent AMD server CPUs are one such example. A more flexible CID allocation scheme might be worthwhile here, but that needs a way to either ask user space how it wants the CID space partitioned or some other way to tell that to the kernel. Having the ability to re-partition the CID space dynamically would be nice (similar to e.g. scheduler affinity masks). If you have ideas how something like that could be realized upstream, that would be great. BPF was one idea we had, but it also has limitations vs. a purely user space scheme.

From what you are saying, the issue that arises with recent AMD server CPUs is due to the large overhead of L3 (LLC) cache misses, am I correct ?

Correct.

If so, then I wonder if we could (optionally) tweak my numa-awareness patchset to make it consider each LLC for node-cpumask indexing instead of using the NUMA node ID. This would increase the number of cpumasks to track per-process (number of L3 domains vs number of NUMA nodes), but would on the other hand eliminate cache line bouncing across L3 domains due to concurrent accesses to the same mm_cid index from different L3 domains.

My objective here is to keep the mm_cid allocation flat (single-level indexing) from a userspace perspective, and handle all the complexity related to locality of the CID allocation within the kernel. This way userspace can benefit from locality gains without even needing to be aware neither of the topology nor of NUMA.

Thoughts ?

Right - that would definitely work. If there's a way to instruct the kernel to switch dynamically per-process between allocation schemes would be even better (the process should be able to choose if it wants the denser or more sparse CIDs because the dense one may give better memory usage trade-offs).

It's still not doing CIDs in user space, but maybe that flexibility is rarely needed. Having a more generic topology (incl. cache hierarchy) aware CID scheme is probably perfectly reasonable if we get it right.

compudj commented 3 months ago

[...]

If so, then I wonder if we could (optionally) tweak my numa-awareness patchset to make it consider each LLC for node-cpumask indexing instead of using the NUMA node ID. This would increase the number of cpumasks to track per-process (number of L3 domains vs number of NUMA nodes), but would on the other hand eliminate cache line bouncing across L3 domains due to concurrent accesses to the same mm_cid index from different L3 domains. My objective here is to keep the mm_cid allocation flat (single-level indexing) from a userspace perspective, and handle all the complexity related to locality of the CID allocation within the kernel. This way userspace can benefit from locality gains without even needing to be aware neither of the topology nor of NUMA. Thoughts ?

Right - that would definitely work. If there's a way to instruct the kernel to switch dynamically per-process between allocation schemes would be even better (the process should be able to choose if it wants the denser or more sparse CIDs because the dense one may give better memory usage trade-offs).

It's still not doing CIDs in user space, but maybe that flexibility is rarely needed. Having a more generic topology (incl. cache hierarchy) aware CID scheme is probably perfectly reasonable if we get it right.

I'm wondering what would be missing from the picture in terms of flexibility with the following:

The kernel detects, at boot time, based on the topology (basically cost of L3 cache misses), whether it needs to index those "node_cpumask" (to be renamed) by LLC (also called core complex AFAIR) index or NUMA node id,
For all processes, the mm_cid values get the locality provided by that detection scheme,
If userspace needs to tweak mm_cid allocations to make it more compact, it can very well do so by making sure its cpu affinity (cpusets, sched_setaffinity) is set taking into account the LLC/NUMA topology of the system: by making the allowed cpus mask local to a given LLC or NUMA node, it can ensure the mm_cid allocation is as compact as a flat/non-local allocation.
By limiting the number of threads in a process, the range of mm_cid values is limited to that number of threads, which ensures compactness of mm_cid for processes with fewer threads than allowed CPUs mask set bits.

I'm curious to hear about use-cases for which this approach would not work.

melver commented 3 months ago

[...]

If so, then I wonder if we could (optionally) tweak my numa-awareness patchset to make it consider each LLC for node-cpumask indexing instead of using the NUMA node ID. This would increase the number of cpumasks to track per-process (number of L3 domains vs number of NUMA nodes), but would on the other hand eliminate cache line bouncing across L3 domains due to concurrent accesses to the same mm_cid index from different L3 domains. My objective here is to keep the mm_cid allocation flat (single-level indexing) from a userspace perspective, and handle all the complexity related to locality of the CID allocation within the kernel. This way userspace can benefit from locality gains without even needing to be aware neither of the topology nor of NUMA. Thoughts ?

Right - that would definitely work. If there's a way to instruct the kernel to switch dynamically per-process between allocation schemes would be even better (the process should be able to choose if it wants the denser or more sparse CIDs because the dense one may give better memory usage trade-offs). It's still not doing CIDs in user space, but maybe that flexibility is rarely needed. Having a more generic topology (incl. cache hierarchy) aware CID scheme is probably perfectly reasonable if we get it right.

I'm wondering what would be missing from the picture in terms of flexibility with the following:

The kernel detects, at boot time, based on the topology (basically cost of L3 cache misses), whether it needs to index those "node_cpumask" (to be renamed) by LLC (also called core complex AFAIR) index or NUMA node id,

Not sure what this detection looks like. One option is to have a bunch of presets for known topologies, but there should also be an option for the user to override that. Also, within a VM guest it's going to be tricky. So I would keep it as simple as possible, and just give the admin the ability to "correct" whatever the kernel thinks it's dealing with.

For all processes, the mm_cid values get the locality provided by that detection scheme,

If userspace needs to tweak mm_cid allocations to make it more compact, it can very well do so by making sure its cpu affinity (cpusets, sched_setaffinity) is set taking into account the LLC/NUMA topology of the system: by making the allowed cpus mask local to a given LLC or NUMA node, it can ensure the mm_cid allocation is as compact as a flat/non-local allocation.

By limiting the number of threads in a process, the range of mm_cid values is limited to that number of threads, which ensures compactness of mm_cid for processes with fewer threads than allowed CPUs mask set bits.

I'm curious to hear about use-cases for which this approach would not work.

This sounds good. I can't say what might be missing, but past experience says that the user should be able to influence this as much as possible so that any shortcomings that are "constant" in the kernel can be corrected without too much hassle.

Thanks, -- Marco

dvyukov commented 3 months ago

I don't see the issue here. The first time user-space is interested in this notification after a preemption is where the first increment from 0 happens. Then all following increments from any userspace library/program increment further away from 0. Then the kernel resets to 0 on return-to-userspace after preemption. This is all composable. Or am I missing something ?

But how do they understand they were rescheduled? When they see what value?

Here is the algorithm I have in mind (please correct me if there are holes in my understanding): Populating a tcmalloc cache, AFAIU this is not a fast-path:

increment rseq area preempt_notif field (e.g. 'inc' instruction),

store tcmalloc cached value to TLS.

Fast-path accessing the cached value, confirming whether it is usable with the preempt_notif check:

load tcmalloc cached value from TLS,

load rseq area preempt_notif field

conditional branch if loaded rseq area preempt_notif field is zero.

Overall, user-space only increments the preempt_notif field (with single-instruction or rseq critical section), or load it for comparison against 0. The kernel only has to set the preempt_notif field back to 0 when returning to userspace after preemption.

Consider we have 2 libraries:
* Lib1 increments 0->1, stores 1.

* Lib2 increments 1->2, stores 2.

* Kernel resets to 0.

* Lib2 notices counter 0, increments 0->1, handles preemption.

* Lib1 sees 1 and assumes no preemption. Bang!
You're right, I was missing that case. My jet-lagged brain is not 100%. :) This requires a bit more thinking then.

Can we instead use two new fields ? One is incremented by user-space and is a free-running counter as a sort-of generation counter (and remember it), and the second field is updated by the kernel with a snapshot of the generation counter when preemption happens. Assuming no overflow, user-space could check if the 2nd field >= its own remembered generation counter value. But this would require more than a simple comparison with zero on the fast path.

This should work. But I would just let kernel increment a single counter (preemption count). Then any library can save a copy of this counter on the side, then compare with the kernel counter, if they are not equal, there was a preemption (handle preemption and save the current counter value on the side).

But as you noted that's more instructions and memory loads on the fast path. Thus my original proposal of instructing kernel to store given value at the given address: https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@mail.gmail.com/

For tcmalloc (or a similar memory allocator) it would allow to even reduce number of instructions on the fast path.

But I can't say I 100% like the interface I proposed b/c it looks a bit like an over-engineered solution for few obscure cases. Perhaps this is a good aspect to discuss in person on LPC.

compudj commented 3 months ago

[...]

Not sure what this detection looks like. One option is to have a bunch of presets for known topologies, but there should also be an option for the user to override that. Also, within a VM guest it's going to be tricky. So I would keep it as simple as possible, and just give the admin the ability to "correct" whatever the kernel thinks it's dealing with.

I'm worried about exposing more boot-time parameters to the end-users than is required. Let's say we initially focus on the NUMA-awareness of mm_cid and leave the EPYC CCX indexing as future work (note: the EPYC bios can be configured to expose one NUMA node per CCX).

In the case of a user caring more about compactness (minimize memory use) of the mm_cid than performance within a given machine or vm, I would say that the user should perhaps just CONFIG_NUMA=n or boot the kernel with numa=off.

If the user cares about mm_cid compactness for a given process only, then perhaps they should just use cgroup cpusets or sched_setaffinity to affine the process threads to CPUs belonging to a single NUMA node, and then they are as compact as if the NUMA-awareness feature was disabled.

Considering these knobs are already in place, what is the story for requiring an additional user-exposed override ?

melver commented 3 months ago

[...]

Not sure what this detection looks like. One option is to have a bunch of presets for known topologies, but there should also be an option for the user to override that. Also, within a VM guest it's going to be tricky. So I would keep it as simple as possible, and just give the admin the ability to "correct" whatever the kernel thinks it's dealing with.

I'm worried about exposing more boot-time parameters to the end-users than is required. Let's say we initially focus on the NUMA-awareness of mm_cid and leave the EPYC CCX indexing as future work (note: the EPYC bios can be configured to expose one NUMA node per CCX).

In the case of a user caring more about compactness (minimize memory use) of the mm_cid than performance within a given machine or vm, I would say that the user should perhaps just CONFIG_NUMA=n or boot the kernel with numa=off.

This is unrealistic at scale. The "user" in this case has to manage millions of machines.

If the user cares about mm_cid compactness for a given process only, then perhaps they should just use cgroup cpusets or sched_setaffinity to affine the process threads to CPUs belonging to a single NUMA node, and then they are as compact as if the NUMA-awareness feature was disabled.

Considering these knobs are already in place, what is the story for requiring an additional user-exposed override ?

Not every workload is the same. But different workloads share the same kernel.

Perhaps it's not required to design the final solution right away, but design something that allows for incremental improvements (with relative ease) as we learn more. What's worse is adding say 90% of the feature, but to get the last 10% we have to start from scratch. As long as these requirements are considered, the design provisions for them to exist in future, and later iterate, is probably good enough.

compudj commented 2 months ago

FYI, I've posted an improved version of the NUMA-awareness series. I figured after doing a lot of benchmarks with various topologies that what we're really after is cache locality, so I went for a simpler and more general approach:

https://lore.kernel.org/lkml/20240903190650.53644-1-mathieu.desnoyers@efficios.com/T/#t

Feedback is welcome!