Options for implementing a Profiling API

fatcerberus commented 7 years ago

See https://github.com/svaarala/duktape/pull/1303#issuecomment-273466783.

Duktape already has an interrupt counter mechanism. This issue to discuss what would be needed to extend that to support comprehensive Ecmascript profiling.

svaarala commented 7 years ago

There are two interfaces where profiling could be accessible:

Debugger protocol
C API

The debugger protocol would be the easiest because the interrupt mechanism and the debugger are already integrated. For example, simply providing the ability to control the Status notification interval (say, provide a status for every N bytecode opcodes executed) would already profile a barebones profiling model.

If profiling performance matters a lot, Duktape would probably need to maintain the profiling counters internally - e.g. associate a counter for every function/PC combination and increment the profile counters every N opcodes. This would be much more complicated internally, and API-wise (especially versioning). Versioning concerns might be ignored in practice though, because making the profiling data (exported through the debugger protocol) version dependent might be an OK compromise.

The C API approach is also viable and extending the interrupt to a user callback is not difficult - for example, user code could register a callback and provide an interrupt frequency (1 call per N opcodes) and then implement profiling based on that. The user callback would need C APIs to query the current function/file and pc/line for the profiling data gathering to work.

fatcerberus commented 7 years ago

The nice thing about sample-based profiling is that raw performance while profiling isn't too important because as long as the same code paths are followed and the samples are taken at discrete intervals (i.e. every X bytecode instructions) then that is enough to show where your code is spending the most time, regardless of how much actual time it takes. That said, the situation is a bit different for an interpreted engine compared to native code, since the sampling is part of bytecode dispatch. Thus it's important that the profiler not slow down bytecode dispatch too much relative to the rest of the program lest it skew the results.

svaarala commented 7 years ago

If one samples every Nth opcode, that has the inherent limitation that it doesn't take into account the performance difference between opcodes. For example, assume N is 2, and one is executing the sequence:

OP1
OP2
OP3
OP1
...

over and over again. With N=2 each opcode would be sampled an equal number of times (+/- 1) so one might conclude each opcode is using the same amount of time. This would be the case even if OP1 took 100x longer to execute than OP2 and OP3.

So there's some inaccuracy built in, but the upside is that wall clock doesn't matter at all.

To overcome this limitation, the profiling code could measure the real time difference between two interrupts, and attribute the time taken to either the previous or current position (consistently one or the other). This also means the sampling interval doesn't need to be fixed because the "weight" multiplier (= time interval) takes it into account.

The downside of this approach that it will include the overhead of the interrupt mechanism, and the tighter the sample frequency, the more impact the mechanism will have on the measurement validity. This is probably what you mean by skew?

If a high frequency timer is available, one can discount the profiling overhead time to some extent by taking a timer reading immediately when the interrupt begins, and another when it ends, and discounting that from the measurement. It's not 100% but may be useful if a nanosecond time source exists.

The best approach uses some kind of virtualized time and somehow discounts the profiling mechanism itself. Using valgrind as a base and implementing a custom valgrind tool would probably allow that, but at a relatively large development cost. Further, it wouldn't work on all platforms.

So some form of compromise approach which remains widely portable is the best baseline approach.

fatcerberus commented 7 years ago

The skew I was referring to is specifically the difference in bytecode dispatch performance between checked vs. non-checked execution. It's kind of what you alluded to with the differences in execution time between different opcodes, but from the other direction: If you have an instruction that, e.g. performs a regexp match, that match operation will take the same amount of time regardless of whether the debugger/profiler is running or not, even though bytecode dispatch will be slower. So if you're trying to correlate samples with wall clock time, it's important that dispatch not be slowed down excessively.

svaarala commented 7 years ago

Right - that comes into play whenever samples are compared to wall clock time.

That effect could be reduced if the interrupt mechanism is used without the debugger (which is quite feasible) and a C API is used. One could still use the C API if the debugger protocol is only used to fetch results or send Status messages without the execution state being "checked" (= breakpoints etc).

Btw, this is one use case which illustrates what I meant earlier on about being able to attach the debugger without actually having a debugger session active (checked execution, breakpoints, etc). One would still be able to communicate with the debug client e.g. for profiling in this case.

fatcerberus commented 7 years ago

By the way, one issue with doing profiling over the debug protocol using, e.g. status notifies, is that you don't get a snapshot of the callstack (GetCallStack is useless in this case because the result is delayed). Without callstack information, sampling only tells you where the PC is for the topmost call, without telling you what other calls led up to it (which is sometimes more useful).

svaarala commented 7 years ago

That's true. Maybe profiling needs its own "ProfilingSnapshot" or some extended Status notify? (Assuming the debugger approach is taken of course.)

svaarala commented 7 years ago

Some details related to that callstack snapshot:

Each call stack level needs to identify, minimally, a source file+line, or a source file+PC.
PC is cheaper than line because the pc2line conversion needs a lookup; that's currently a bit-packed structure with a skip list. But maybe during profiling it could be an expanded array so that it'd be just an array lookup.
Instead of source file, the snapshot could identify the function directly - but because the same lexical function has potentially multiple function instances this gets a bit confusing IMO.
If PC is used, the profiling debug client will ultimately need to be able to do a pc2line lookup for the final results. However, that translation needs the identity of the related function to access the correct pc2line data.

fatcerberus commented 7 years ago

On that last point, my instinct is that, for the purposes of profiling, identifying the lexical function is enough. In most cases multiple instances should have similar performance characteristics (closures do complicate things though, not denying that).

svaarala commented 7 years ago

Agree with that - the lexical view is often what users want anyway. More detailed information might be useful in some sort of "performance dump" that you could dig into if you wanted to.

svaarala commented 7 years ago

To summarize some of the high level issues/themes so far:

Debugger vs. C API?
Interrupt executor mechanism improvements necessary for profiling?
Execution mode improvements necessary for profiling, e.g. normal, checked, profiling?
Internal performance counter state vs. just delivering stateless snapshots and letting the debug client / caller handle the rest?
File+line vs. file+PC; with PC needs a separate pc2line conversion which is a bit difficult after-the-fact because the lookup is function instance (not source file) related.
Config option to enable fast pc2line handling, i.e. a direct pc->line array instead of bit packing, so that pc2line conversions will be a lot faster to support profiling better.

svaarala commented 7 years ago

Documentation-wise it'd maybe be useful to document the execution states separately from the debugger documentation. It's a high level design feature which is related to things besides debugger, e.g. profiling here, but also maybe script timeouts, etc.

fatcerberus commented 7 years ago

I guess it does make sense to do profiling over the debug protocol. In a way it's a form of debugging too, just that you're debugging performance rather than logic.

svaarala commented 7 years ago

I'd prefer that as the first profiling step too, and then ensuring that (1) the debugger overhead is reduced if that's an issue (for example, maybe support a non-checked but instrumented execution mode) and (2) spend a bit of time to reduce the barrier of starting to use the debugger.

For example, it'd be great to add profiling support to the debugger default web UI because many users start with that. (It'd be even greater to spend a little time in making the debugger web UI more than the placeholder it currently is :-)

svaarala / duktape

Options for implementing a Profiling API #1307