nim-lang / RFCs

A repository for your Nim proposals.
135 stars 26 forks source link

std/cputicks: high resolution low overhead strictly monotonic cpu counter #411

Open timotheecour opened 2 years ago

timotheecour commented 2 years ago

proposal

new module std/cputicks providing cpu instruction-level granularity, with much higher precision and lower overhead than either times.cpuTime or std/monotimes (and less module import depdendencies), in particular strict monotonic instead of monotonic counters

This is what should be used for profiling and micro benchmarks (as opposed to looping N times over some code, which can skew results in various ways, affect the cache, register allocation, etc);

note

example 1: benchmarking

example 2: profiling

I'm using this in https://github.com/nim-lang/Nim/pull/15827 in not-yet-pushed commits which adds full profiling capability with low overhead; it advantageously replaces nimproc and other platform specific profilers, it understands nim code and gives meaningful results wrt recursive function, exception handling, etc more details on this later

example 3: correctness for things like initRand, etc

allows fixing things like https://github.com/nim-lang/Nim/issues/17898 with a strict monotonic counter and low overhead (https://github.com/nim-lang/Nim/pull/18729 has more overhead and less resolution, in contrast); see https://github.com/nim-lang/Nim/pull/18149 which uses std/cputicks to fix initRand in #17898

availability

links

PR

implemented in https://github.com/nim-lang/Nim/pull/18743

konsumlamm commented 2 years ago

Is it really necessary to add a new module for this? This is one of the things where I actually agree that they should be in the standard library, but why do we have to make a new module for every other new function? For example, possible candidates would be system (unless we never want to add anything to it again), monotimes, times (or a submodule, like std/times/cputicks which can be reexported by std/times) or cpuinfo (it also currently has only one function, so we could broaden the scope a bit).

IMO, the standard library is already bloated with unnecessary (for a standard library, not in general) modules, multiple modules only export a single function and have so for years (but they can't be deprecated, because people use them). This is not meant as a criticism of this particular suggestion, but rather of the attitude of adding a new stdlib module for every function that seems occasionally useful.

juancarlospaco commented 2 years ago

I agree with the idea and I hope it can be added to StdLib, I am neutral about a new module or add it to existing one.

timotheecour commented 2 years ago

both monotimes and times carry huge dependencies with them as can be seen with --processing:filenames:

Hint: >>> monotimes: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/times.nim [Processing]
Hint: >>>> times: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/strutils.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/parseutils.nim [Processing]
Hint: >>>>> parseutils: include: /Users/timothee/git_clone/nim/Nim_prs/lib/system/inclrtl.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/math.nim [Processing]
Hint: >>>>>> math: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/bitops.nim [Processing]
Hint: >>>>>>> bitops: import: /Users/timothee/git_clone/nim/Nim_prs/lib/core/macros.nim [Processing]
Hint: >>>>>>> macros: include: /Users/timothee/git_clone/nim/Nim_prs/lib/system/inclrtl.nim [Processing]
Hint: >>>>>> math: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/fenv.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/algorithm.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/std/enumutils.nim [Processing]
Hint: >>>>>> enumutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/typetraits.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/unicode.nim [Processing]
Hint: >>>>> unicode: include: /Users/timothee/git_clone/nim/Nim_prs/lib/system/inclrtl.nim [Processing]
Hint: >>>>> unicode: include: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/includes/unicode_ranges.nim [Processing]
Hint: >>>> strutils: include: /Users/timothee/git_clone/nim/Nim_prs/lib/system/inclrtl.nim [Processing]
Hint: >>>>> strutils: import: /Users/timothee/git_clone/nim/Nim_prs/lib/std/private/strimpl.nim [Processing]
Hint: >>>> times: import: /Users/timothee/git_clone/nim/Nim_prs/lib/pure/options.nim [Processing]
Hint: >>> times: include: /Users/timothee/git_clone/nim/Nim_prs/lib/system/inclrtl.nim [Processing]
Hint: >>>> times: import: /Users/timothee/git_clone/nim/Nim_prs/lib/posix/posix.nim [Processing]
Hint: >>>> posix: include: /Users/timothee/git_clone/nim/Nim_prs/lib/posix/posix_macos_amd64.nim [Processing]
Hint: >>>> posix: include: /Users/timothee/git_clone/nim/Nim_prs/lib/posix/posix_other_consts.nim [Processing]

whereas std/cputicks carries 0 depdendencies.

modules are a cheap abstraction, see benchmarks in https://github.com/nim-lang/Nim/pull/14614; we can always decide later on to import/re-export std/cputicks in std/monotimes if needed as an alternative way to access those APIs, but allowing to use std/cputicks without adding a dependency on all those modules makes sense IMO.

Varriount commented 2 years ago

both monotimes and times carry huge dependencies with them as can be seen with --processing:filenames

Ok, and why should this dictate how modules are organized? Assuming the reason is related to performance, isn't the whole point of the caching and incremental compilation mechanisms to handle large modules with many dependencies?

Dedicating an entire module to such a narrow degree of functionality unnecessarily complicates the standard library's documentation and architecture. This functionality should be incorporated into another module, such as times, os, or some new benchmarking module, (a la Python's timeit module).

I'm also unconvinced that this has any substantial benefit over existing timing methods. Elsewhere, there is the assumption that RDTSC is strictly monotonic (a trait other timing methods lack). Is this true? Even on multiprocessor systems? I'd prefer to see explicit assertions from multiple reliable sources on this.

timotheecour commented 2 years ago

I'm also unconvinced that this has any substantial benefit over existing timing methods.

if you fail to see the point of low overhead and high resolution for measuring performance, I'm not sure what will help. See also [1]. TSC are widely used for this very purpose. See for eg https://software.intel.com/content/www/us/en/develop/articles/best-timing-function-for-measuring-ipp-api-timing.html, https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf, http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/, or any other resource on this topic.

In particular, the generic nim profiling instrumentation I implemented on top of https://github.com/nim-lang/Nim/pull/15827 relies on this functionality and would give much less accurate results with existing timing methods.

Elsewhere, there is the assumption that RDTSC is strictly monotonic (a trait other timing methods lack). Is this true? Even on multiprocessor systems? I'd prefer to see explicit assertions from multiple reliable sources on this.

See for example https://stackoverflow.com/questions/65975629/does-rdtscp-increment-monotonically-across-multi-cores, or https://stackoverflow.com/questions/10921210/cpu-tsc-fetch-operation-especially-in-multicore-multi-processor-environment

On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate. So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.

[1] https://stackoverflow.com/questions/10921210/cpu-tsc-fetch-operation-especially-in-multicore-multi-processor-environment#comment14276739_10921210

You can't count on clock_gettime() for nanosecond precision, by the way; it's only precise to within about a quarter microsecond. I ran into this when I was trying to get super-precise timings and found that gettime() itself cost more than 250ns. stackoverflow.com/questions/7935518/… – Crashworks Jun 7 '12 at 23:05

Varriount commented 2 years ago
timotheecour commented 2 years ago

What is the difference in performance between RDTSC and QueryPerformanceCounter/clock_gettime(CLOCK_MONOTONIC_RAW)?

https://gist.github.com/timotheecour/e5d85be09b141b4bf9f877e1c8731025 nim r -d:case3 -d:danger main results on docker, linux, x86_64

("getCpuTicks()", 0.0006974549999999999, 3043570212101928416)
("cpuTime()", 0.0208491, 1205.134996188012)
("getMonoTime().ticks", 0.368238207, -1255464341632237948)
("getTicks2()", 0.358631642, -1255684171946224548)

RDTSC gives 0.00069 clock_gettime(CLOCK_MONOTONIC_RAW) gives 0.358631

=> 520X less overhead

this is on docker though, you'll likely get different results running on linux directly

on osx, with n = 10_000_000 to get more precise results:

("getCpuTicks()", 0.06743399999999999, 1607536410026022468)
("cpuTime()", 4.383573999999999, 22598207.40499241)
("getMonoTime().ticks", 0.300935, 3469576765264413078)
("getTicks2()", 0.1488739999999993, 16169834729749057)

=> 2.2X less overhead compared to a raw call to mach_absolute_time

Is this difference worth the fact that RDTSC is platform specific?

yes if you care about measure performance, profiling, etc. Any modern CPU will have some TSC available, if RDTSC if not available. getCputTicks can be extended in future work to wrap those on such platforms. See also https://stackoverflow.com/questions/7935518/is-clock-gettime-adequate-for-submicrosecond-timing or the links I gave earlier; TSC's are commonly used for benchmarking/profiling.

So on older CPUs, any program relying on RDTSC being strictly monotonic may break?

within same CPU will still be monotonic.

Is there an alternative instruction (RDTSCP?) that will work?

RDTSCP is newer and adds some overhead compared to RDTSC, depending on use cases may or may not be appropriate. A good API will offer both to accomodate use cases. For a profiling use case with nested calls, overhead is a key concern because the overhead affects timing in callers.

While these arguments support adding in a function that calls RDTSC, none of them support placing it in its own module.

std/cputicks can grow functionality as needed and be imported/re-exported if needed in other modules, whereas moving it in monotimes would carry too many dependencies as shown in https://github.com/nim-lang/RFCs/issues/411#issuecomment-904959558

arnetheduck commented 2 years ago

given the changing nature of hardware support and instructions, this is clearly something that could be implemented in a stand-alone library, with great benefit to the community: it could have a release cycle disentangled from the Nim compiler releases - new hardware support could be added quicker etc.

Varriount commented 2 years ago

Ok, let's start from the top.

Please note that these are not all criticisms against using RDTSC. Some of these are criticisms about how RDTSC is proposed to be used, or regarding the arguments in favor of it.

From comment 1

example 1: benchmarking

... This is what should be used for profiling and micro benchmarks (as opposed to looping N times over some code, which can skew results in various ways, affect the cache, register allocation, etc);

Micro-benchmarks are nearly (if not always) impossible to write correctly without repetition. In modern SMP systems there is simply too much "noise" (processor scaling, OS paging, background processes) to get a clean result in one execution. This is setting aside the fact that the usefulness of micro-benchmarks is often questionable.

example 2: profiling

...

This is the one situation where I think RDTSC is useful. Fine-grained profiling tends to involve lots of timing calls, and unlike benchmarks, there is a need for the program to run as fast as it normally would.

example 3: correctness for things like initRand, etc

...

availability

...

Aside from the previous point, note this rather explicit disclaimer at the top of the linked benchmark:

// NOTE: Not all cpu/platform/kernel combinations guarantee that this
// clock increments at a constant rate or is synchronized across all logical
// cpus in a system.
//
// If you need the above guarantees, please consider using a different
// API. There are efforts to provide an interface which provides a millisecond
// granularity and implemented as a memory read. A memory read is generally
// cheaper than the CycleClock for many architectures.
//
// Also, in some out of order CPU implementations, the CycleClock is not
// serializing. So if you're trying to count at cycles granularity, your
// data might be inaccurate due to out of order instruction execution.


links

...

From comment 4 and comment 8, respectively:

both monotimes and times carry huge dependencies with them

std/cputicks can grow functionality as needed and be imported/re-exported if needed in other modules, whereas moving it in monotimes would carry too many dependencies

Ok, and why are these dependencies a problem? And is this problem significant enough that it can't be solved through improving Nim's implementation? Keep in mind that whatever problem this is must be weighed against the costs of adding a new standard library module. A new standard library module increases the complexity of reading and learning the standard library's documentation, as well as maintaining the standard library.

From comment 6:

In particular, the generic nim profiling instrumentation I implemented on top of nim-lang/Nim#15827 relies on this functionality and would give much less accurate results with existing timing methods.

How much less accurate? Is the loss in accuracy substantial enough to have a noticeable, consequential impact?

From comment 8:

...

General Remarks

Given how the characteristics of high-frequency counter instructions can vary across architectures and operating systems, guaranteeing strict monotonicity for any sort of cpuTicks function would either limit where Nim could reliably run, or be inaccurate. This is further complicated by virtualization, SMP systems, and NUMA architectures. In my opinion, guaranteeing non-strict monotonicity on most architectures is the best that can be done.

That being said, high-frequency counter instructions have their use-cases. They just need to be used judiciously and with caution, where OS-provided functions cannot be reasonably used. They are not a general replacement for said OS-provided functions.

I also do not believe that this should be part of a single module. The functionality provided by this function is small, and std/times/std/monotimes already exist. If someone wants a high-performance timer function, and don't know what it will be called, where are they going to look first? And if the reason for placing this function in its own module is due to related modules having "too many dependencies", that problem needs to be addressed directly, rather than avoided or papered over.

[1] https://software.intel.com/content/dam/develop/external/us/en/documents-tps/325462-sdm-vol-1-2abcd-3abcd.pdf [1a] See [1], PDF page 1735

haxscramper commented 2 years ago

I too, fully agree that addition of new modules should be carefully considered. We already have tons of small modules that declare one-two procs, and while it is hard to argue that "modules are cheap abstraction" ... from implementation point of view, but I think it is easier for people to discover necessary functionality in already existing modules.

And some other, like typetraits (why not std/macros?). I'm pretty sure it is possible to come up with explanation that would allow to brush off all items on this list (because of X, Y and Z), but I don't think it is a good overall trend, as stdlib grows in all directions despite all attempts to make nim core "small".

These modules are added, never expanded (because they were initially created with extremely narrow scope in mind) and people have to either scan through full list of modules, or hope that someone points out necessary module (in forum post, chat discussion etc.).

Araq commented 2 years ago

The small modules are perfectly fine IMHO but we should not add what is already available as a Nimble package. Like benchmarking frameworks.

timotheecour commented 2 years ago

we should not add what is already available as a Nimble package

nimble search bench returns currently 8 packages: url: https://gitlab.com/define-private-public/stopwatch (git) url: https://github.com/ivankoster/nimbench.git (git) url: https://github.com/treeform/benchy (git) url: https://github.com/disruptek/criterion (git) url: https://github.com/jrfondren/getr-nim (git) url: https://github.com/xflywind/timeit (git) url: https://github.com/disruptek/golden (git) url: https://github.com/olliNiinivaara/StashTable (git)

of these, only 2 packages mention rdtsc:

Instead of reinventing the wheel in each benchmarking package, this PR enables having a dedicated module to implement this properly and provide the best possible cross-platform support (wrapping platform specific TSC's as needed for other platforms), that other benchmarking packages can reuse and count on.

Furthermore, getCpuCounters is critical for the profiling functionality I've added to https://github.com/nim-lang/Nim/pull/15827; no other approach would work for reasons I'm happy to explain if needed. Beyond benchmarking/profiling, it provides a reliable way on modern CPUs to have strict monotonic counters with minimum overhead (~10ns) which can advantageously replace alternative approach based on a {.threadvar.} (still has re-entrance issues), a lock, an atomic, or a global (not thread-safe), which can be used to solve issues (eg initRand PR) or improve performance of APIs (eg genOids) without the overkill of relying on std/times or overhead of atomics/threadvar/locks.

Araq commented 2 years ago

So provide a Nimble package as the foundation for these other Nimble packages and use "atlas" to be able to use it in Nim's core. I wrote Atlas for a reason.

timotheecour commented 2 years ago

I wrote Atlas for a reason.

It must be usable inside stdlib.

atlas is far from production ready, see all issues I ran into: https://github.com/nim-lang/Nim/labels/atlas; I plan to release some fixes but it's still far. The end goal would be that sh build_all.sh installs nim_csources_v1 => koch => atlas => dependencies for stdlib => ./koch boot => nim, so that stdlib itself (not just compiler/tools) can have external dependencies. Without this (stdlib can depend on 3rd party packages via atlas), we'll keep running into similar endless arguments. That's the end goal but we're nowhere near there and progress shouldn't stop in the meantime.

Furthermore, a pkg/cputicks package wouldn't help with VM code; see for example https://forum.nim-lang.org/t/5261#52440 and other threads which requests ability to benchmark VM code.

With this PR, getCputicks works on all backends including VM/nimvm and fallbacks for js; future work can add support as needed for other platforms and a global fallback using libc and additional related APIs.

Varriount commented 2 years ago

Except that:

[0] https://documentation-service.arm.com/static/611fa684674a052ae36c7c91 [1a] See [1], PDF page 2852 [2] https://documentation-service.arm.com/static/60e6f8573d73a34b640e0cee [2a] See [2]. PDF page 367

timotheecour commented 2 years ago

Claiming that cpuTicks is strictly monotonic (which the implementation PR still does!) is unwise, given that this cannot be claimed for other systems: As far as I know, the ECMA standard does not have an API for a high-performance strictly-monotonic timer.

good point, I've pushed a commit to clarify this

The ARMv8-A architecture[1] manual explicitly states that two reads to the PMCCNTR_EL0 [...]

good research, I've rephrased things a bit/fixed a reference and added it as a comment, and added some notes regarding future work support for this platform, see commit

Incidentally, this feedback loop justifies the point of adding such functionality to stdlib; the review process can only improve things and lead to better design/APIs/choices, esp for handling cross-platform details; a separate nimble package would be unlikely to get the same attention and improve over time into a comprehensive cross-platform optimized solution.

No concrete reason has been given for why placing this functionality in std/monotimes cannot be reasonably done

the main thing that stops me from placing this in std/monotimes is dependency on std/times, which carries itself tons of dependencies; if that dependency were removed, std/cputicks would be fine to move to std/monotimes; this is not simple however, as this itself requires non-trivial tasks in order not to end in a bad design:

A separate std/cputicks sidesteps all these issues and doesn't prevent an import/re-export of std/cputicks from std/monotimes anyways, which completely addresses the point regarding discoverability.

arnetheduck commented 2 years ago

a separate nimble package would be unlikely to get the same attention and improve over time into a comprehensive cross-platform optimized solution.

I mean, this is quite a ridiculous statement given the large amount of incredible libraries that exist out there - both nim, and other languages. The fact that unmaintained libraries exist out there does not mean that maintained libraries do exist - that's not a logical leap you can take, sorry. Arguably, many libraries out there are written because the stdlib is unmaintained in certain areas.

You could argue that the attempt to add the code to stdlib without doing proper research prompted someone to look at the problem, but that doesn't sound like a sustainable development model - disproving potentially dubious code is quite a drain - this is one of the reasons why it makes sense to develop things outside a common repository first, so that people interested in the problem can refine the solution - when/if it reaches a certain quality level, it can be considered for inclusion. Imagine how many fewer "let's break common stuff because I don't like how the previous author wrote it" discussions we would have then.

c-blake commented 2 years ago

Only marginally touched upon by @Varriount & TC, but important here is that it is important to not confuse the "units of measurement" or maximum resolution with what they measure and full understanding of this really discounts the utility of "clock cycles".

There is both the duration of execution (via serialiazaton) and the "error" (via 12-24 cycle pipeline depths) of flushing pipelines. This creates a situation where you still have quite a bit of measurement error - on the order of at least 3-6 ns on ~5GHz CPUs. You can (maybe) not serialize, too, but then there is a lot of error on the "what" you are measuring (two time points in the middle of a 100s of instruction re-order buffer). Either way, 1 ns "reporting resolution" is really not bad compared to measurement error.

Also, I believe not yet mentioned, the "CPU cycle counter" on Intel only "ticks" at the fixed frequency, even when all the cores are down clocked to 800 MHz or whatever. So, essentially the "cycle counter" has on Intel become a "real time" query since the Nehalem era of the late noughties. See this stack overflow thread which mentions checking out the turbostat source code in the Linux kernel. Given that, you might as well use regular time units like "nanoseconds". Tenths of a nanosecond might induce less rounding error, but see above, 1 ns is not so bad as-is.

Ok - so real time nanoseconds is not bad in terms of either measurement error or semantics, but what about query performance? Don't know about MacOS/Windows, but Linux clock_gettime code and calibration data lives in a read-only virtual memory page that the kernel shares with every user process. There is no "real system call" involved. You can actually cast a function pointer to a magic same-address to call that VDSO code directly without libc. I just did a test running 1e8 clock_gettime(CLOCK_MONOTONIC) in ~1.3 sec. So, throughput of 1 per 13 ns or about 60 cycles on that machine..probably at most 2x the measurement error. Latency is harder to assess, but I expect the latency of a serializing cycle counter read to be much more close to the latency of such calls. Maybe 1.25x off. These 1.25..2x improvements are not huge.

Looping back to measurement error, some might say that it is unreliable/unrealistic to expect the "sub-latency inner time-points" of two distinct calls to "line up" and assign as measurement error in the delta the entire latency of these calls which is >>1ns.

In conclusion, as much as I used to be a fan of using cycle counts myself, the modern CPU world has evolved to the point where there is little to no meaningful advantage on Intel - either "cycle semantic" or "query performance". Unless you are measuring direct time deltas much greater than 100 cycles you are probably just kidding yourself as to meaningfulness of said time deltas. If you are not kidding yourself, 20 ns is still not so big compared to 3-10 ns errors (regardless of "units" being cycles or nanoseconds). I'm not exactly happy about this, but it is how it is. If you are riding the Intel rocket, one might paraphrase the immortal black comedy Dr. Strangelove, "Learn to Stop Worrying About Cycles and Love the Nanosecond". :-( :-)