svaarala / duktape

Duktape - embeddable Javascript engine with a focus on portability and compact footprint
MIT License
5.95k stars 515 forks source link

Improve opcode dispatch further by using a local curr_pc in executor #294

Closed svaarala closed 9 years ago

svaarala commented 9 years ago

With #265 bytecode dispatch uses a cached thr->curr_pc pointer which is copied back into the topmost activation act->curr_pc whenever activation states may change or an up-to-date activation state is needed (e.g. for line numbers in the debugger).

This can probably be improved a bit further:

This is faster because thr is sometimes spilled but the stack frame pointer is not.

Preliminary testing suggests this is roughly 5% faster than #265 (to be verified with more careful testing).

Changes:

svaarala commented 9 years ago

After #265 there seems to be a spurious compiler warning:

duk_js_call.c: In function ‘duk_handle_call’:
duk_js_call.c:835:15: warning: variable ‘entry_thr_curr_pc’ might be clobbered by ‘longjmp’ or ‘vfork’ [-Wclobbered]

See if that can be fixed with this change too.

svaarala commented 9 years ago

Quick test using a dispatch heavy test (LDCONST micro benchmark):

# Local curr_pc
$ sudo nice -20 python util/time_multi.py --count 10 --sleep 10 --mode all --verbose ./duk.O2.localpc tests/perf/test-const-load.js 
Running: 5.430000 5.410000 5.780000 5.370000 5.480000 5.560000 5.450000 5.380000 5.390000 5.370000
min=5.37, max=5.78, avg=5.46, count=10: [5.43, 5.41, 5.78, 5.37, 5.48, 5.56, 5.45, 5.38, 5.39, 5.37]

# Master, including thr->curr_pc dispatch and local consts base pointer
$ sudo nice -20 python util/time_multi.py --count 10 --sleep 10 --mode all --verbose ./duk.O2.master tests/perf/test-const-load.js 
Running: 6.430000 6.450000 6.440000 6.440000 6.420000 6.450000 6.410000 6.420000 7.190000 7.220000
min=6.41, max=7.22, avg=6.59, count=10: [6.43, 6.45, 6.44, 6.44, 6.42, 6.45, 6.41, 6.42, 7.19, 7.22]

Practical difference will be less, depending on how dispatch heavy the executed bytecode is.

fatcerberus commented 9 years ago

By "dispatch heavy" I assume you mean, e.g. tight loops?

svaarala commented 9 years ago

Anything that does a lot of opcode dispatch compared to other execution - if a tight loop involves a regexp match for instance, the regexp performance will of course dominate.

svaarala commented 9 years ago

I'll polish and merge this branch tomorrow. Performance work for 1.3 will then be finished, yay :-)

svaarala commented 9 years ago

C declarations are weird:

cdecl> explain int * volatile * volatile curr_pc
declare curr_pc as volatile pointer to volatile pointer to int
fatcerberus commented 9 years ago

I guess that's what happens when you're about half a step away from assembly. ;-)

svaarala commented 9 years ago

Quick performance test between master (thr->curr_pc and local curr_pc):

test-array-read.js            : duk-O2  4.00 duk-O2-master  4.31
test-array-write.js           : duk-O2  4.39 duk-O2-master  4.75
test-bitwise-ops.js           : duk-O2  0.87 duk-O2-master  0.94
test-buffer-nodejs-read.js    : duk-O2  4.41 duk-O2-master  4.84
test-buffer-nodejs-write.js   : duk-O2  5.56 duk-O2-master  6.15
test-buffer-object-read.js    : duk-O2  4.62 duk-O2-master  4.75
test-buffer-object-write.js   : duk-O2  5.78 duk-O2-master  5.94
test-buffer-plain-read.js     : duk-O2  3.97 duk-O2-master  4.50
test-buffer-plain-write.js    : duk-O2  3.64 duk-O2-master  4.17
test-call-basic.js            : duk-O2 15.08 duk-O2-master 16.04
test-compile-mandel-nofrac.js : duk-O2 13.90 duk-O2-master 13.59
test-compile-mandel.js        : duk-O2 17.27 duk-O2-master 17.01
test-compile-short.js         : duk-O2 10.42 duk-O2-master 10.84
test-const-load.js            : duk-O2  6.06 duk-O2-master  7.40
test-empty-loop.js            : duk-O2  2.18 duk-O2-master  2.33
test-fib.js                   : duk-O2  8.52 duk-O2-master  9.70
test-global-lookup.js         : duk-O2 11.66 duk-O2-master 10.42
test-hello-world.js           : duk-O2  0.00 duk-O2-master  0.00
test-hex-decode.js            : duk-O2  5.32 duk-O2-master  5.33
test-hex-encode.js            : duk-O2 26.94 duk-O2-master 27.30
test-json-integer-parse.js    : duk-O2  0.40 duk-O2-master  0.40
test-json-number-parse.js     : duk-O2  5.07 duk-O2-master  5.15
test-json-serialize.js        : duk-O2  2.97 duk-O2-master  3.00
test-json-string-bench.js     : duk-O2  5.59 duk-O2-master  5.50
test-json-string-parse.js     : duk-O2  0.59 duk-O2-master  0.59
test-json-string-stringify.js : duk-O2  1.27 duk-O2-master  1.27
test-prop-read.js             : duk-O2  8.01 duk-O2-master  7.96
test-prop-write.js            : duk-O2  7.18 duk-O2-master  7.17
test-reg-load.js              : duk-O2  2.76 duk-O2-master  3.12
test-reg-readwrite-object.js  : duk-O2  3.73 duk-O2-master  3.82
test-reg-readwrite-plain.js   : duk-O2  2.51 duk-O2-master  2.84
test-regexp-string-parse.js   : duk-O2 10.55 duk-O2-master 10.99
test-string-array-concat.js   : duk-O2  6.41 duk-O2-master  7.12
test-string-compare.js        : duk-O2  4.16 duk-O2-master  4.12
test-string-decodeuri.js      : duk-O2  3.67 duk-O2-master  3.63
test-string-encodeuri.js      : duk-O2  4.30 duk-O2-master  4.53
test-string-intern-match.js   : duk-O2  2.35 duk-O2-master  2.37
test-string-intern-miss.js    : duk-O2  2.54 duk-O2-master  2.56
test-string-plain-concat.js   : duk-O2  4.11 duk-O2-master  4.20
test-string-uppercase.js      : duk-O2  2.60 duk-O2-master  2.73
svaarala commented 9 years ago

This is now in master and is the final performance change for 1.3. There's potential for very weird fallout if something doesn't work right, so keep your eyes peeled :-)

fatcerberus commented 9 years ago

I'll pull the latest master and try it out with minisphere. If there are any issues I'll let you know!

fatcerberus commented 9 years ago

Just got around to looking at the recent changes. I must say this was quite clever:

duk_instr_t **ptr_curr_pc;

The duk_hthread holds a backpointer to the executor's local PC to avoid an extra level of syncing (which is what I thought this change was going to entail).

Only one thing worries me about this setup, and that's how it will scale to recursive executor calls (as would happen in the case of, e.g. JS -> C -> JS.

fatcerberus commented 9 years ago

Well, no weird issues showed up in my (admittedly not very thorough) testing. So it looks like this bit of black magic worked. ;-)

svaarala commented 9 years ago

Each time the bytecode executor is called recursively the call handling will store and restore ptr_curr_pc so it should not be an issue. That's not to say I'm not a bit uncomfortable with "syncing back", but opcode dispatch is so central to performance that it seems justified.

fatcerberus commented 9 years ago

I may have found a regression with this change, I have to do a bit more testing to see if I can reproduce it.

svaarala commented 9 years ago

What was the outward effect?

fatcerberus commented 9 years ago

The one thing I did notice was, for my pause-on-error change, when stopped at an error GetCallStack would sometimes report the wrong line number and PC (the beginning of a function instead of the correct place).

svaarala commented 9 years ago

Things like that will be incorrect when a "sync" is missing (copying the local curr_pc to act->pc). I think I'll find this one simply by code reading.

But was this the regression you were referring to or was it something else?

svaarala commented 9 years ago

I think there's at least a sync missing in the case where the bytecode executor triggers a side effect like a getter call. It won't cause any harm other than the act->pc being incorrect so tracebacks etc will look wrong.

I'll confirm and fix that shortly.

fatcerberus commented 9 years ago

The specific scenario where I noticed this was where a Duktape/C function throws an error and the debugger pauses before longjmp, the reported PC for the ecma function directly below is incorrect

svaarala commented 9 years ago

I can replicate this behavior by using a breakpoint in e.g. a getter function and resuming until the breakpoint is hit. It can't be reproduced by stepping because that will continuously sync the PC. That's why I missed this in my basic debugger test before merge.

Fix and bug testcase coming shortly. Thanks!

fatcerberus commented 9 years ago

What's weird is that the "enter debugger now" helper I added for #286 already includes a sync before the interrupt, so I'm not actually sure why this happens.

svaarala commented 9 years ago

The missing sync was for a non-ecma-to-ecma call, so that the PC of the calling activation would be outdated. The sync you added is for the current activation?

fatcerberus commented 9 years ago

Ah, yes, that makes sense. :)

svaarala commented 9 years ago

Ok, fix (and bug test) is in master, here's a snapshot: http://duktape.org/snapshots/duktape-1.2.99-20150826172312-v1.2.0-401-g8b5bcca.tar.xz

fatcerberus commented 9 years ago

I have to go to work right now, but I'll test out the snapshot tonight and let you know.

svaarala commented 9 years ago

Sure, no problem - I think the fix should work for at least that GetCallStack issue, but it's of course best if you can verify it.

fatcerberus commented 9 years ago

The snapshot fixed the issue. If you're curious, the specific issue I discovered turned out to have nothing to do with ecmascript-to-native calls, the bug was in new handling:

// line 1
(function() {
    print("The 8:12 pig was here!");
    function f() { print(Duktape.act(-3).lineNumber); }
    new f();  // prints 3, should print 5
})();

But whatever you did appears to have fixed this case too.

svaarala commented 9 years ago

It was actually related: the fix was in handle_call() and new uses handle_call() even when calling an Ecmascript constructor.

fatcerberus commented 9 years ago

Just triggered an assert that might be related to this (without assertions, this ends in a segfault):

PANIC 54: assertion failed: curr_pc < DUK_HCOMPILEDFUNCTION_GET_CODE_END(thr->heap, fun) (duk_js_executor.c:2266) (calling abort)
fatcerberus commented 9 years ago

The value of curr_pc at the time according to MSVC: curr_pc = 0x000000dfab127cb8 {4261281277}

This value is consistent from run to run (which is odd, since it's a pointer).

svaarala commented 9 years ago

Any idea what's going on when this happens?

fatcerberus commented 9 years ago

Well, here's my native callstack for starters:

    engine64.exe!issue_debug_notification(const wchar_t * const message) Line 125   C++
    engine64.exe!__acrt_report_runtime_error(const wchar_t * message) Line 142  C++
    engine64.exe!abort() Line 61    C++
    engine64.exe!duk_default_panic_handler(int code, const char * msg) Line 127 C
>   engine64.exe!duk_js_execute_bytecode(duk_hthread * exec_thr) Line 2266  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1496  C
    engine64.exe!duk_call(duk_hthread * ctx, int nargs) Line 61 C
    engine64.exe!run_script(script * script, bool allow_reentry) Line 189   C
    engine64.exe!render_map() Line 1284 C
    engine64.exe!js_RenderMap(duk_hthread * ctx) Line 2856  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1383  C
    engine64.exe!duk_js_execute_bytecode(duk_hthread * exec_thr) Line 3408  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1496  C
    engine64.exe!duk_call_method(duk_hthread * ctx, int nargs) Line 84  C
    engine64.exe!duk_bi_function_prototype_apply(duk_hthread * ctx) Line 206    C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1383  C
    engine64.exe!duk_js_execute_bytecode(duk_hthread * exec_thr) Line 3408  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1496  C
    engine64.exe!duk_call(duk_hthread * ctx, int nargs) Line 61 C
    engine64.exe!run_script(script * script, bool allow_reentry) Line 189   C
    engine64.exe!update_map_engine(bool is_main_loop) Line 1397 C
    engine64.exe!js_MapEngine(duk_hthread * ctx) Line 1546  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1383  C
    engine64.exe!duk_js_execute_bytecode(duk_hthread * exec_thr) Line 3408  C
    engine64.exe!duk_handle_call(duk_hthread * thr, int num_stack_args, unsigned int call_flags) Line 1496  C
    engine64.exe!duk_pcall(duk_hthread * ctx, int nargs) Line 138   C
    engine64.exe!main(int argc, char * * argv) Line 337 C
    [External Code] 
fatcerberus commented 9 years ago

Hm, I wonder if I could use Duktape's JS debugger to narrow things down? There are a few levels of native calls in there though that might make it more difficult, though... and that's assuming single-stepping doesn't mask the bug.

svaarala commented 9 years ago

The most likely reason is that some PC sync operation uses an out-of-date thr->ptr_curr_pc somewhere. Just to narrow things down, do you use coroutines?

svaarala commented 9 years ago

I think I'll make a small change where thr->ptr_curr_pc is set to NULL whenever any activation states change: otherwise a PC sync operation would copy the previous curr_pc value into another activation.

fatcerberus commented 9 years ago

No coroutines, no. But some weird corruption is definitely going on here.

Specifically, this comes up when starting the Spectacles battle engine (that being the game I wrote minisphere for). The first battle I tested lead to the segfault and failed assert mentioned above. I tried a different battle, and I'm SO glad I implemented pause-on-throw now... Duktape threw a nonsensical "invalid index" error (valid script filename but line number of 0), allowing me to view the Ecmascript callstack (which is quite deep!) and local variables.

With that information, I found the function responsible for the segfault:

// .render() method
// Renders the BattleActor in its current state.
BattleActor.prototype.render = function()
{
    if (!this.isVisible && this.damages.length == 0 && this.healings.length == 0) {
        return;
    }
    this.sprite.blit(this.x, this.y, this.opacity * 255);
    for (var i = 0; i < this.damages.length; ++i) {
        var text = this.damages[i].text;
        var x = this.x + 16 - this.messageFont.getStringWidth(text) / 2;
        for (var i2 = 0; i2 < text.length; ++i2) {
            var yName = 'y' + i2.toString();
            var y = this.y + this.damages[i][yName];
            var color = this.damages[i].color !== null ? this.damages[i].color
                : CreateColor(255, 255, 255, 255);
            DrawTextEx(this.messageFont, x, y, text[i2], color, 1);
            x += this.messageFont.getStringWidth(text[i2]);
        }
    }
    for (var i = 0; i < this.healings.length; ++i) {
        var y = this.y + this.healings[i].y;
        var color = this.healings[i].color !== null ? this.healings[i].color : CreateColor(64, 255, 128, 255);
        var textColor = BlendColors(color, color);
        textColor.alpha *= this.healings[i].alpha / 255;
        DrawTextEx(this.messageFont, this.x + 16, y, this.healings[i].amount, textColor, 1, 'center');
    }
};

And the line which called it:

this.actors[type][i].render();
fatcerberus commented 9 years ago

Worth noting is that the local variables (including the loop counter) are all listed as being set to undefined. Which means the render() function probably didn't even start executing yet.

fatcerberus commented 9 years ago

Forgive me if I seem a little overexcited here; I probably enjoy debugging more than would generally be considered healthy. :stuck_out_tongue_closed_eyes:

svaarala commented 9 years ago

:-)

fatcerberus commented 9 years ago

What I don't understand is why the value of the corrupt curr_pc isn't changing from run to run. If it were using an outdated value (from a previous activation, say), the actual value should still change as memory allocation isn't exactly known for being deterministic, even with virtual addressing. It looks more to me like curr_pc is being clobbered.

svaarala commented 9 years ago

This debug log just before the assert error might help shed some light:

DUK_D(DUK_DPRINT("curr_pc=%p base=%p end=%p &curr_pc=%p thr->ptr_curr_pc=%p",
    (void *) curr_pc,
    (void *) DUK_HCOMPILEDFUNCTION_GET_CODE_BASE(thr->heap, fun),
    (void *) DUK_HCOMPILEDFUNCTION_GET_CODE_END(thr->heap, fun),
    (void *) &curr_pc,
    (void *) thr->ptr_curr_pc));
fatcerberus commented 9 years ago

Okay, I'll add that debug print now. By the way, the error is reproducible in a 32-bit x86 build as well, and will wonders never cease: curr_pc = 0x0671d87c {4261281277}

Still the same! Of course I realize now that the value in braces is the result of dereferencing the pointer, but still intriguing that it's the same value.

svaarala commented 9 years ago

I'm trying to figure out what your game code does differently from the testcases: all testcases pass with assertions enabled. One difference is that your calling C side is obviously more complex than the "duk" command line tool, and that there are more Duktape/C functions involved.

I've added setting ptr_curr_pc to NULL whenever activations are changed (when doing call setup and when throwing errors). This is so that while the pointer itself may be valid, it's no longer related to the topmost activation's curr_pc pointer so syncing would be incorrect. I'm not sure if this is what's happening in your case; but here's a snapshot with the change: http://duktape.org/snapshots/duktape-1.2.99-20150827100951-v1.2.0-403-gf0535a3.tar.xz

svaarala commented 9 years ago

That decimal value in hex is 0xfdfdfdfd by the way.

fatcerberus commented 9 years ago

Lovely, that debug print generates so much output that the game never gets started because the console is bogged down printing stuff.

fatcerberus commented 9 years ago

Got it, had to be clever with a conditional: [D] duk_js_executor.c:2271 (unknown): curr_pc=00000094B45DE5B8 base=00000094B5003298 end=00000094B50035BC &curr_pc=00000094B13E9F58 thr->ptr_curr_pc=00000094B13E9F58

fatcerberus commented 9 years ago

As for that 0xfdfdfdfd value, I found this bit of info: "0xFD Fence Memory Also known as "no mans land." This is used to wrap the allocated memory (like surrounding it with fences) and is used to detect indexing arrays out of bounds."

Interesting.

fatcerberus commented 9 years ago

Unfortunately it's 3:30am and I need some sleep, so further bug-hunting will have to wait. :)

Looking at that debug print though, it looks like the PC someone ended up just outside the bytecode. Which is consistent with the "fence memory" thing.

svaarala commented 9 years ago

No worries, I'll try to figure this out. Thanks for the help!

svaarala commented 9 years ago

I think I'll add a thr->idx_curr_act to accompany thr->ptr_curr_pc which identifies the activation the ptr_curr_pc is related to. This makes syncing always correct even when there are activation changes, so it'll be much easier to get the syncing correct.