Open htejun opened 2 days ago
I have looked a bit, and this is what I am thinking.
Considering the discussion on the BPF mailing list (https://lore.kernel.org/bpf/CAEf4BzaBNNCYaf9a4oHsB2AzYyc6JCWXpHx6jk22Btv=UAgX4A@mail.gmail.com/), I think we can assume the following two (or something similar) new APIs:
bpf_get_hw_counter()
, which eventually calls rdtsc
in x86bpf_hw_counter_to_ns()
), which converts a time stamp counter to nano-seconds (or vice versa)Based on these two new APIs, we can provide two common utility functions at common.bpf.h
abstracting the details of bpf_get_hw_counter()
and bpf_ktime_get_ns()
.
scx_bpf_get_time()
: returns a current time in a certain unit (timestamp counter or ns). It chooses bpf_get_hw_counter()
over bpf_ktime_get_ns()
if available for performance.scx_bpf_time_to_ns()
: converts the time returned from scx_bpf_get_time()
to nano-seconds.In my opinion, it would be quite difficult to handle the clock drift of rdtsc
if its bound is unknown. So rather than handling the clock drift magically, it would be more straightforward to let scx scheduler handle it. I think in the scx schedulers' usage case, only thing to consider will be checking the new timestampe is smaller than an old timestamp.
What do you think, @htejun ?
One challenge is that rdtsc
is reported to not scale well on large machines on intel sapphire rapids, so even it looks like we'd likely need caching whether we use rdtsc
or rdtscp
.
For reference, Chris Mason implemented a simple benchmark to tests TSC performance: https://github.com/masoncl/tscbench. The main problem being observed is rdtscp
and rdtsc
depending on the CPU type getting progressively slower on larger setups as CPUs get saturated.
Some thoughts on the problem:
1) Why not make the API monotonic by default if there is no apparent good way of handling rdtsc
drift, apart from abefore > after
check? We could leave it to the scheduler if there was a design choice to be made about this, but that doesn't seem to be the case. Making the call monotonic seems more intuitive if the alternative is to always manually check.
2) Is there a good reason to return both timestamp counters and ns? Possibly returning the raw counter forces the scheduler to consider the source of the timestamp when parsing it which doesn't seem desirable. I think a single both scx_bpf_get_time
and scx_bpf_time_to_ns
could get rolled into a single scx_bpf_get_time_ns
that only returns ns.
3) If rdtsc is problematic for some machines but works fine for others, we still need the option to return a cached value. We can add a flag in the call to explicitly specify the source of the value, but would it be worth it? Are there scenarios where reading rdtsc
is superior to returning a cached value, or can we always use the latter?
Thoughts @multics69 @htejun ?
Thank you for the feedback @htejun and @etsal !
I agree with @htejun 's point. Making a time stamp monotonic system-wide is a hard problem. I think it would be difficult (if not impossible) without a central cache hotspot.
Exposing only absolute time would be less confusing.
I will further develop ideas about the caching, if it would be possible efficiently and scalabily to cache a timestamp without creating cache hotspots.
BTW, @htejun -- do you have some tsc benchmark numbers on sapphire rapids? It would be great to understand how bad rdtsc
is to know the budget we can use for caching.
I only heard results second-hand. IIRC, on sapphire rapids, rdtsc wasn't much better than rdtscp.
SCX schedulers tend to use
bpf_ktime_get_ns()
a lot. On x86, this eventually is serviced byrdtsc_ordered()
which is therdtscp
instruction. The instruction is known to be expensive and has scalability issues on large machines when the CPUs are saturated (the cost of the instruction increases as machine gets saturated).In most cases, we don't really care about nanosec accuracy that we're paying for. There can be a couple approaches in addressing this:
ktime_get_ns()
result and provide a kfunc to access the cached time. The kernel should have reasonable cache invalidation points (e.g. at the start of dispatch and when rq lock is released during dispatch and so on).rdtsc
in x86) along with helpers to compare and calculate delta between two timestamps.rdtsc
is cheaper and doesn't have scalability issues thatrdtscp
has but it's unclear how this would map in other archs.Considerations:
rdtscp
guarantees this but that's why it's expensive.after - before
underflowing) can be cause some headaches. Can probably be alleviated reasonably with a good set of helpers.