rkoeninger / ShenScript

Shen for JavaScript
BSD 3-Clause "New" or "Revised" License
57 stars 4 forks source link

Improving Startup and Overall Performance #7

Closed rkoeninger closed 5 years ago

rkoeninger commented 5 years ago

Check List

Original Post

Picking up on the performance issue mentioned in #5

Observations

Startup performance in async mode is unacceptably slow. Seems to vary mainly with the browser. Brave creates the kernel in about 1.2s, but Firefox takes about 15s.

@vrescobar reported a 25s startup time in Chrome.

I think startup needs to happen in at most 500ms for ShenScript to be feasible for production use.

Startup time is normally much lower in sync mode, meaning async/await performance likely differs considerably between JavaScript engines.

Solutions

Some work has already been done to reduce the number of functions that are actually async and then don't get awaited to cut down on unnecessary asynchronicity. More opportunities for this could be done by analysis at build time or a manual survey and just listing the function names by configuration.

A more drastic approach would be to abandon async/await, even though it's convenient, I've heard it's slower than writing callback-style code. That would require a much more involved transpiler.

vrescobar commented 5 years ago

I didn't read the codebase, but I can tell that being javascript single threaded, Async will be the right answer for:

In the case of Shen Script I believe that this job should be delegated to the guest language which will also offer constructs to handle IO/Events, and if the code does not reflect that it will end up freezing the UI, but everything should be up to shen's code I believe.

rkoeninger commented 5 years ago

@vrescobar One of the sections of the docs I'm in the process of writing is the explanation of why async/await was needed.

The primitive IO functions in Shen are synchronous. The read-byte function in particular makes porting Shen to JavaScript difficult. Code that uses read-byte, including kernel functions like read-file-as-charlist and load are written in synchronous style. So we need a read-byte function that is asynchronous in the host language but still looks like a synchronous function in the guest language.

I was thinking about this and realized "that's what async/await does!" So read-byte and other IO functions are async functions or return Promises. And so the functions that call the primitive IO functions need to be async so they can await the primitive IO functions. And then functions that call those functions need to be async, and so on, all the way to the repl function, shen.shen.

So I don't think you can deal with asynchronicity in the guest language because the guest language was designed in a way that counts on the host language to deal with it.

rkoeninger commented 5 years ago

First draft of this async explanation now in the docs: https://shenscript.readthedocs.io/en/latest/design.html#optional-pervasive-asynchronocity

rkoeninger commented 5 years ago

I also wrote about the hybrid sync/async approach here: https://shenscript.readthedocs.io/en/latest/design.html#hybrid-sync-async-kernel

rkoeninger commented 5 years ago

@vrescobar Also, I forgot to mention that having the chrome dev tools closed while loading the page increases speed by about 10x. It takes 15secs for me with the console open, 1.2secs with it closed.

vrescobar commented 5 years ago

you are right about the dev tools, I cannot understand why, maybe the chrome debugger has a high impact on the loadtime.

vrescobar commented 5 years ago

I've been trying the synchronous mode in the browser and it is consistently starting less than two seconds (1261ms) in the same setup from the other day (Chrome with or without devtools, huge laptop with many res).

That's very closed to the original goal.

On safari (iphone 6s) it takes 5942ms which is still a bit far from optimal. If I set webpack from development to production it improves to 3028ms which is a great improvement but still not reaching its goal (only on mobile by the way).

There must be a way to lighten the AST.

Checking the chrome performance tab, and looking at the profiler is very revealing. Basically Cons, trampoline and throw and the ...args from funSync are taking the largest times. Maybe some small rewrites there could make the difference:

Screenshot 2019-08-22 at 00 04 27
vrescobar commented 5 years ago

I added a commit which greatly improves significantly the performance on mobile, but it doesn't change much on browser. It contributes to the fix of this issue but doesn't completely solve it (we are not subsecond yet). See the comment code for the actual metric values and improvements.

Regarding stop using Object.freeze I wouldn't worry too much since it is mostly a "security" contract and all the calls and access to this code happen within the shen engine and your code, so there is no risk that an uninformed user changes these objects while the trampoline is running.

Unfortunately github stacked my pull requests together since they depended on each other.

As for next steps, keep looking at the profiler, but my gut feeling says that passing arguments to functions, trampoline and cons (the bread and butter of any lisp) have a lot to say to the performance of the system.

rkoeninger commented 5 years ago

Attempted (and then reverted) a change where in the generated kernel.*.js, system functions would be bounds to local const variables and then referred to by those bindings in other system functions.

The idea was that this way, the kernel functions wouldn't have to dynamically resolve or call functions that would never change anyway, and the JS runtime would be able to optimize these calls, but that's not what happened. There was no significant change in performance, saving at most 1 second in sync mode kernel tests and maybe 2 in async mode.

This approach actually helped quite a bit in ShenSharp, so I don't know why it didn't help here.

rkoeninger commented 5 years ago

My other thought is that the fun() wrapper that makes functions partially-applicable might be wasting time. When rendering JS, could bake that partial-application logic right into the function.

Also, curried application is not a requirement, so I might just remove that. shen-cl does not support it.

vrescobar commented 5 years ago

I tried some optimizations there too, also without much success. My guess went to the argument serialization, using ...args vs using f.apply(args) could trigger an additional transformation under the cover but it didn’t make any positive difference.

I still think that the Cons list and arg serialization play a big role and I would like to pass them by reference rather copy and convert the Cons to a more imperative structure but I didn’t get so far.

The trampoline code also had a good performance penalty but unless we rewrite the AST using gotos or loops it won’t get any better; I tried a Y-Combinator in real shen but the performance seems largely impacted.

The other thing still to try would be to move some parts to wasm.

Víctor R. Escobar

On 26. Aug 2019, at 17:52, Robert Koeninger notifications@github.com wrote:

My other thought is that the fun() wrapper that makes functions partially-applicable might be wasting time. When rendering JS, could bake that partial-application logic right into the function.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rkoeninger commented 5 years ago

My plan was to avoid variable-arity functions and applications entirely if possible.

A function like

(defun add (X Y) (+ X Y))

would normally get translated to something functionally equivalent to

$.f['add'] = $.fun((X, Y) => $.asNumber(X) + $.asNumber(Y));

with the fun making it functionally equivalent to

$.f['add'] = (...args) =>
  args.length === 2 ? $.asNumber(args[0]) + $.asNumber(args[1]) :
  args.length < 2 ? (...more) => $.f['add'](...args, ...more) :
  $.asFunction($.f['add'](...args.slice(0, 2)))(...args.slice(2));

where it might perform better if we remove the rest arguments (and the extra stack frame) like:

$.f['add'] = (x, y) =>
    arguments.length === 2 ? $.asNumber(x) + $.asNumber(y) :
    arguments.length === 1 ? y => $.asNumber(x) + $.asNumber(y) :
    arguments.length === 0 ? arguments.callee :
    raise('invalid number of args');

Some problems:

Dynamic reference to self:

$.f['add'] = (x, y) =>
    arguments.length === 2 ? $.asNumber(x) + $.asNumber(y) :
    arguments.length === 1 ? y => $.f['add'](x, y) :
    arguments.length === 0 ? $.f['add'] :
    raise('invalid number of args');

Named reference to self:

$.f['add'] = function add$(x, y) { return
    arguments.length === 2 ? $.asNumber(x) + $.asNumber(y) :
    arguments.length === 1 ? y => add$(x, y) : // bug: partially-applied function won't be partially applicable itself
    arguments.length === 0 ? add$ :
    raise('invalid number of args');
};

Explicit argument count passed as first argument to every function to completely avoid the arguments object for performance and maybe compatibility reasons:

$.f['add'] = function add$(n$, x, y) { return
    n$ === 2 ? $.asNumber(x) + $.asNumber(y) :
    n$ === 1 ? y => add$(2, x, y) :
    n$ === 0 ? add$ :
    raise('invalid number of args');
};

Sorry for the long post, but there's a lot of options here.

rkoeninger commented 5 years ago

Re: Conses

I'm really not sure what to do about conses since so much of the Shen kernel is built around exploding things into cons lists and then recursively processing them.

I had an idea in the past to have an ArrayCons which was considered a cons by cons?/isCons but actually wrapped an array with an offset. When you called (tl X), that would return a new ArrayCons that wrapped the same array but with an increased offset. This would offer greater cache coherence in the CPU, keeping sequential data adjacent in cache.

I don't think it helped since conses would normally be allocated sequentially anyway and it involved allocating a new object for every value in a list that gets iterated, instead of just de-referencing a pointer.

An effective solution to cons performance might require static analysis of recursive list processing code and optimizing them into counting loops, which I'm trying to avoid.

vrescobar commented 5 years ago

These both are expensive ideas to implement, lots of refactoring, and I guess you are comfy but I am not that into the codebase since I have to work and I am learning official shen.

How do you measure performance? I think the "final" solution should be Web Assembly and custom data structures, but I cannot help right now with it since I don't know really know wasm (it is a lisp btw). Maybe we can start implementing a new persistent Cons based on js arrays and hardcode some common use cases, after all performance is an illusion.

vrescobar commented 5 years ago

What do you think about using clojurescript's immutable list?

https://github.com/immutable-js/immutable-js

if we just import { List } Webpack should do some dead code elimination, isn't it?

rkoeninger commented 5 years ago

I was thinking something like this:

// postBuildLambda enhances a transpiled lambda with partial applicability
// body is already a compiled ast (SafeId constructor)
// params is an array of js-ast's
const postBuildLambda = (params, body, async) => {
  const arity = params.length;
  const arityCheck = n => Binary('===', Literal(n), Member(Arguments(), Id('length')));
  return Function(
    'rec$',
    params,
    Conditionals(
      [arityCheck(arity),
        body],
      ...range(arity).slice(1, arity - 1).reverse().map(i =>
        [arityCheck(i),
          postBuildLambda(
            params.slice(i),
            Call(Id('rec$'), params.slice(0, i)),
            async)]),
      [arityCheck(0),
        Id('rec$')],
      Call$('raise', [Literal('invalid arity')])),
    async);
};

full application is checked first as it is the most common case

where Conditionals builds out the ? : chain

const Conditionals = (...args) =>
  butLast(args).reduceRight(
    (tail, [cond, ifTrue]) => Conditional(cond, ifTrue, tail),
    last(args));

decent amount of work for an experiment, but i'm going to give it a try

rkoeninger commented 5 years ago

Yet another option is to have the transpiler just generate a lambda that takes the remaining arguments based on the arity of the called function at the call site. (So some code calls 'add' with 1 as the only argument and y => add(1, y) is generated at the call site) This is what shen-cl does. Downside is that the generated code won't adapt when the called function changes to a different arity, like in the repl, and neither will existing partial applications, but i'm not sure how that would work anyway. Partial applications interact strangely with a dynamic environment.

rkoeninger commented 5 years ago

Removed the anyAwaits attempt at optimization and saw either a slight performance boost or negligible effect.

anyAwaits would determine if generated AST contained any await expressions and if it didn't, the enclosing function would not be made async in order to avoid making more promises or having denser promise chains.

Better way to do this might be to have overwrites (pinhole optimizations) for I/O functions that read a whole line or file. In the kernel, those are generated as recursive functions that build a long list of Trampoline-Promise chains for every byte that's read.

rkoeninger commented 5 years ago

Inlining the arity check logic would change this

$.d("add", (X, Y) => $.asNumber(X) + $.asNumber(Y))

into this

$.d("add", function add$(n$, X, Y) {
  return n$ === 2 ? $.asNumber(X) + $.asNumber(Y) :
         n$ === 1 ? function add$1$(n$, Y) {
          return n$ === 1 ? add$(2, X, Y) :
                 n$ === 0 ? add$1$ :
                 argErr("add", n$ + 1);
         } :
         n$ === 0 ? add$ :
         argErr("add", n$);
})

That's a lot more code and it grows exponentially with respect to the original arity.

I was thinking about this and I'm going to start looking at the shen-cl approach instead (inlining a lambda at the call site based on the known arity at that time). Might still be necessary to have logic to check for superfluous arguments as JS will just ignore them which would be unexpected behavior.

tizoc commented 5 years ago

@rkoeninger take a look at how shen-scheme handles this, most of the time you know the arity of the function being invoked and you can handle partial applications at the call-site, there is no need to do anything special with the compiled functions themselves in regards to currying and partial application.

tizoc commented 5 years ago

Relevant code is here: https://github.com/tizoc/shen-scheme/blob/master/src/compiler.shen#L188-L295

tizoc commented 5 years ago

Another thing, you are allowed to make this fail or return a weird value when code is not statically type checked:

$.f['add'] = $.fun((X, Y) => $.asNumber(X) + $.asNumber(Y));

Those asNumber calls are overhead that you are paying for even when it is not required (that is 99.99% of the time). If the javascript port of Shen returns "12" for (+ "1" 2), that is still a valid implementation choice.

If you still want to raise an error in such cases another option is to provide "safe" and "unsafe" compilation modes, and only do this check when compiling in "safe" mode.

rkoeninger commented 5 years ago

@tizoc So there's one way in which I want to replicate a behavior in shen-cl and one I want to avoid.

Type Checks

I wanted Shen on JS to have better type semantics than JS, disallowing mixed-argument-type behavior of the + operator and catching type errors early wherever possible.

It also comes closer to the behavior of shen-cl's + where it raises an error for code like (+ 1 "2").

Partials at Call Site

So this is behavior that shen-cl also has that I was trying to avoid. I was thinking of a function as an object that handles whatever number of arguments by returning a partially-applied version of itself - or at least seeming that way. This allows flexibility in the definition order of functions.

So I tried this in shen-cl:

(define curried-add X -> (add X))
(define add X Y -> (+ X Y))
(curried-add 1) \\ invalid number of arguments: 1

Because curried-add is defined first, the transpiler doesn't know that add is going to be partially applied and generates code to apply and to a single argument which fails later.

The convention I've noticed in Shen to do define dependent functions before the functions they depend on (main-at-top) so this could be a problem.

The current approach I have in ShenScript doesn't have this problem - but yeah could be making it very slow.

Curried Function Application

Also, I know this isn't a part of the spec, should it just be removed? Other ports don't have this? (I know this and ShenSharp do)

Right now, when a function is given more args than it needs, it takes args up to its arity and then applies the resulting function to the remaining args. This allows for transparent application of curried functions:

((/. X (/. Y (+ X Y))) 3 4) \\ 7
tizoc commented 5 years ago

Type Checks

Agree, the way Javascript behaves is error prone, but the solution makes things slower. An alternative is to add those checks when dynamic code is compiled, but to generate code without checks when type-checking and the optimise flag are enabled:

(0-) (optimise +)
true

(1-) (tc +)
true

(2+) (define add { number --> number --> number } X Y -> (+ X Y))
add : (number --> (number --> number))

(3+) (tc -) 
false : boolean

(4-) (ps add)
[defun add [V1187 V1188] [+ [type V1187 number] [type V1188 number]]]

I don't know if anyone has taken advantage of these type annotations already, so I'm not sure how reliable they are. I have played with them a bit in the past (when playing with the possibility of writting an OCaml port), but never did something serious.

Partials at Call Site

You can do proper handling of dependencies on functions that have not been defined yet if you want, but since the spec doesn't require it (somewhere in the docs it is explicitly stated that this cannot be depended on), and others port don't handle it, portable code is not going to depend on that, which means that your port ends paying a non-trivial cost for something that is not going to happen in code that expects to be portable across ports (and not going to happen much in regular code either).

Same for changing arities, the compiler even raises a warning when you do that.

(5-) (define add X Y Z -> 1)

warning: changing the arity of add can cause errors.

Curried Function Application

The lambda example you wrote there works as expected in both Shen/Scheme and Shen/CL too (for both static function calls and lambdas), I don't know about other ports. In the case of Shen/Scheme it is done without incurring any extra cost to normal calls, and I'm almost sure the same applies to Shen/CL.

I am not sure if the spec requires it (I would guess it does), I would have to re-check to be sure.

rkoeninger commented 5 years ago

re: curried application - right that does work. My unit test is checking for

(define add X -> (/. Y (+ X Y)))
(add 3 4) \\ 7

That's not supported by shen-cl but i don't know how necessary it is. The fact that function types appear curried and functions can return functions gives the impression to me that this should work

(0-) (tc +)
true

(1+) (define add {number --> number --> number} X -> (/. Y (+ X Y)))
add : (number --> (number --> number))

(2+) (add 3 4)
invalid number of arguments: 2
tizoc commented 5 years ago

Ahhh, I see. I don't know why that doesn't work in Shen/CL, it is easy once you have solved everything else that it has solved already.

It works in Shen/Scheme, and is handled by this code in the compiler, not much is needed:

(0-) (define add X -> (/. Y Z (+ X Y Z)))
add

(1-) (add 1 2 3)
6

(2-) (tc +)
true

(3+) (define add {number --> number --> number} X -> (/. Y (+ X Y)))
add : (number --> (number --> number))

(4+) (add 3 4)
7 : number
rkoeninger commented 5 years ago

I tried taking out most of the type check functions (in primitives and inlined in generated code) and it had no noticeable effect on the time of the test suite or web startup time so I'm leaving them in.

Note that the transpiler does some basic type analysis and removes unnecessary type checks. (+ 4 (* 3 X)) would compile to 4 + 3 * $.asNumber(X) unless X was bound to an expression known to be of a numeric type, in which case it would be 4 + 3 * X.

tizoc commented 5 years ago

@rkoeninger you will not see much difference in the time it takes to run the test suite because none of the tests are numerically heavy.

Btw, for some reason some of the code blocks are missing in the documentation, see the bottom of https://shenscript.readthedocs.io/en/latest/design.html for example (but Github does render them)

rkoeninger commented 5 years ago

The casts also got removed from all the non-numeric operations too (except converting between shen and js booleans), and right now we're just focusing on improving web startup time so a page using ShenScript won't take 15s to load.

PS - thanks for the tip on the missing code blocks. Sphinx likes those to be labelled I guess.

tizoc commented 5 years ago

Sorry, didn't intend to derail things.

How do I compile a chunk of Kl code into Javascript? I have tried this, but I always get a "not a valid form" error message:

(js.ast.render (js.ast.compile [defun a [X] [+ X X]]))
(js.ast.render (js.ast.compile (hd (read-from-string "(defun a (X) (+ X X))"))))

I also see that the generated bundle has a lots of calls to eval, I guess this is webpack's doing.

tizoc commented 5 years ago

@rkoeninger it seems a good amount of time is being spent in the dictionary functions, something you can try is to override those with a native implementation of dicts using either objects or Map (the introduction of those as overrideable was made thinking about Javascript in particular because unlike in most other languages, no native hashing function is exposed by the platform), that is going to avoid a lot of try/catch blocks being executed and raises. This in addition will speed up all uses of get and put, which should help with startup time.

rkoeninger commented 5 years ago

@tizoc Fixed js.ast.compile in latest master (f6807ca). It was expecting a tree of JS arrays as opposed to a tree of Shen/KL lists and needed a conversion function thrown in there.

And that's something that might be taking up time is every KL expr gets converted to an array tree before being transpiled - it makes the transpiler easier to write but it's another pass of transforms.

tizoc commented 5 years ago

@rkoeninger thank you! it works now.

Another thing that I see taking up a lot of time is symbols. Unless the javascript compiler somehow optimizes this magically, I think compiling symbols to s`symbol-name` is adding considerable overhead because each time code execution reaches a symbol it has to do a lookup to find the symbol, if you can figure a way to have the symbol be instantiated only once and then reference it by just referencing a variable that would help too. This is easy in Scheme and Common Lisp because both languages provide syntax for symbols, not sure what is the best way to handle it in Javascript.

rkoeninger commented 5 years ago

Symbols in JS are a little different from symbols in Lisp. JS symbols are all unique even if they have the same description string and they don't even necessarily have a description string. So Symbol('abc') === Symbol('abc') is false.

Symbol.for(name) will intern a symbol by a particular name and always return the same symbol for the same name. The s`symbol-name` syntax is ultimately the same as Symbol.for('symbol-name'). I like how concise s is for user code. Generated code could use symbolOf or even Symbol.for instead, but all of the above are indirect calls to Symbol.for.

Custom Symbol Type

Or are you suggesting not to use JS symbols? I could add a Sym class with custom equality logic that compares names. They might be faster to create, but they would take longer to compare. I think JS symbols use an identity comparison that is about as fast as comparing two pointers.

Shouldn't be too hard to give it a try.

tizoc commented 5 years ago

No, what I mean is that every time the execution path reaches a call to Symbol.for, it has to do a lookup, right? and then depending on if the symbol exists already, intern the string or return the already existing instance. This would be like compiling to Lisp/Shen that uses calls to (intern "thesymbol") instead of 'symbol. Having to do a lookup from an input string is far more expensive than just referencing an address.

What I have in mind (not familiar enough with the compiler at this moment to know how viable it is btw) is to "hoist" symbols to top-level variables so that they can be referenced directly to avoid calls to Symbol.for:

$.f.define($.s`test`, $.s`X`, $.s`->`, $.r([$.s`symbol`, $.s`symbol`])))

would instead be:

const $SYMS$symbol = Symbol.for("symbol"); // Symbol.for is called only ever once
// then this variable is referenced every time we want that symbol
// only doing it for a single symbol here, but same would apply to test, X, etc
$.f.define($.s`test`, $.s`X`, $.s`->`, $.r([ $SYMS$symbol,  $SYMS$symbol])));

Maybe this is a dumb idea, not sure, I have to leave for a bit now but I'm going to write a test in plain javascript to see is Symbol.for does add overhead as I think or not and I will let you know.

rkoeninger commented 5 years ago

Interned Symbol Hoisting

Oh, I see. That would be most applicable in the kernel where there's a large block of pre-rendered code with many repeated symbols. If you did this for every individual expression, it wouldn't prevent many lookups.

I just tried a quick benchmark of looking up randomly-named symbols with Symbol.for and it appears to do about 17,000,000/sec in Chromium. At least on this hexacore laptop.

It also just occurred to me that it might be spending so much time in s because s calls String.raw which does toString conversions and interleaving and so a decent amount of conditional logic. When the symbol doesn't need to be a template string, might as well use symbolOf at least. That's an easy thing to try out.

rkoeninger commented 5 years ago

Pinhole Optimizations

I also haven't done anything with optimizationing specific functions.

The map function looks ripe for optimization as right now it's recursing through trampolines (and maybe a promise-trampoline chain). I could have it use a JS loop and also take advantage of the fact that conses are mutable under the hood and build the list by the tail in a single pass.

@tizoc In your experience, what are some of the best functions to optimize?

tizoc commented 5 years ago
let res = 0;
let arr = [0, 0, 0]
let sym = Symbol.for("string");

for (let i = 0; i < 99999999; i++) {
  res += 1;
  arr[i % 3] = Symbol.for("string");
  //arr[i % 3] = sym;
}

console.log(res);
console.log(arr);

On my laptop, the version that uses Symbol.for on every iteration of the loop runs in 2.58s, the one that references the variable to avoid symbol lookup runs in 0.37s (and 0.18s if the array is not touched, just to make sure that it is not faster because the compiler is optimizing it away). I am not sure how much of a problem this is in the current implementation, I just see in the profiler results that a lot of time is spent in s.

For optimizations, the dict.* functions (or just hash, but not doable in Javascript), and get are good candidates (these last two depending on how expensive exception handling is). Then there are other functions that help with shen->kl compilation and typechecking speed, like shen.pvar?, numbyte?, symbol?, shen.analyze-symbol, etc, but I don't think those are affecting your load times.

See https://github.com/tizoc/shen-scheme/blob/master/src/overrides.shen

If exception handling adds overhead there are some things you can do when compiling the kernel code but that you can't enable for code in general because it is not safe: https://github.com/tizoc/shen-scheme/blob/master/src/compiler.shen#L128-L147

The transformation there makes use of */or functions that return a default value instead of raising an exception, by doing so a bunch of functions get rewritten into a version that doesn't setup an exception handler.

Since the profiler results also shows that a lot of time is being spent on raise this is something that may help a bit too.

Note that I don't override get in overrides.shen because the above transformation takes care of avoiding the setup of an exception handler in get. (Actually, I don't handle the dict accessor in there, but I should. But the get/or variant that is used when it is wrapped in trap-error doesn't set up an exception handler and is faster than the regular get, so it is still faster than if nothing was done)

tizoc commented 5 years ago

@rkoeninger this version is as fast as the variable reference (not sure if it translates to the case where the table is much bigger):

let res = 0;
let arr = [0, 0, 0];
let symbols = {};
symbols.string = Symbol.for("string");

for (let i = 0; i < 99999999; i++) {
  res += 1;
  arr[i % 3] = symbols.string;
}

console.log(res);
console.log(arr);

Edit: if I fill symbols with a lot of mappings to other symbols it gets as slow as the Symbol.for version.

rkoeninger commented 5 years ago

I was resistant to the idea of binding symbol values to locals at first because it wasn't "neat" enough and too ad-hoc, I wasn't sure how to fit it into the transpiler, but then I realized it could be done with the established iife pattern.

So if a function is going to refer to a symbol repeatedly, it can be wrapped in another function that takes the symbol value, refers to the parameter instead and then that outer function is immediately applied to arguments which are the looked-up values the function will need.

(defun sign (X)
  (cond ((> X 0) positive)
        ((< X 0) negative)
        (true    zero)))

becomes

f.sign = ((positive$, negative$, zero$) =>
  X => X > 0 ? positive$ :
       X < 0 ? negative$ :
               zero$)
    (s`positive`, s`negative`, s`zero`);

The X => ... function is what gets assigned to f.sign and it has positive$, etc in scope with the interned symbol values so it shouldn't have to look them up again since they never change.

This can be taken a step further by inlining reference cells that hold onto global functions and symbols. This way, the function can still be redefined and existing referrers will still have a direct handle to where it's being stored. I originally thought the $.functions.map type of structure could be optimized by the JS runtime, but maybe it isn't I'll have to experiment with this. Each invocation of the function would still have to fetch the function from the reference cell but accessing index 0 on an array is probably faster than looking up a key in an object with 1000 properties.

I like this iife approach because it retains expression oriented syntax and allows for the possibility of memoizing the results of whatever pre-computation desired: symbol interns, global function reference cells, transpile-time constant expressions e.g. like building a long constant list once instead of every time:

https://github.com/Shen-Language/shen-sources/blob/304d7c15cd0a71ff11c161cdfb807a50d737ebc1/sources/sys.shen#L158-L164

A technique like this is used in ShenSharp which uses an AST interpreter where the enhanced AST holds onto the reference cell for global functions and symbols. I didn't think it would be relevant here since ShenScript isn't an interpreter, but here's an opportunity to do it here too.

Inlining pointer to reference cell that holds global function value:

https://github.com/rkoeninger/ShenSharp/blob/b57cfb9797d038c06de41b1d54009f0fee561979/Kl/Evaluator.fs#L72-L79

tizoc commented 5 years ago

I like that approach, seems like it would work quite well while at the same time being considerably easier to implement than what I had in mind of using variable references (which would require a quite different compilation model from what you have now).

tizoc commented 5 years ago

@rkoeninger I'm thinking that you can use a custom object type instead of an array and the JS compiler is probably going to be able to do a better job at optimizing it btw (like for example being able to avoid for out-of-bound accesses), will try it now and post results in a few minutes.

tizoc commented 5 years ago

Well, guess what, it is mostly the same in all cases, for reads both array and custom object reads are almost as fast as referencing a variable. It is only when there are writes that arrays get slower, I would have expected the opposite (reads more penalized than writes).

These are the tests that I did first, but they have too many writes, once I noticed that I was testing the wrong thing and removed the writes the times were 0.67s for variables and 0.71 for object and array.

Test with writes (but again, not what is needed in this case):

Plain version, finishes in 1.04s:

let oldboxed = 0;
let boxed = 0;

for (let i = 0; i <= 999999999; i++) {
  oldboxed = boxed;
  boxed = i;
}

console.log(oldboxed);
console.log(boxed);

Array version, finishes in 1.95s:

const oldboxed = [0];
const boxed = [0];

for (let i = 0; i <= 999999999; i++) {
  oldboxed[0] = boxed[0];
  boxed[0] = i;
}

console.log(oldboxed[0]);
console.log(boxed[0]);

Object version, finishes in 1.15s:

function Box(value) {
  this.value = value;
}

Box.prototype.read = function () { return this.value; };
Box.prototype.write = function (value) { this.value = value };

const oldboxed = new Box(0);
const boxed = new Box(0);

for (let i = 0; i <= 999999999; i++) {
  oldboxed.write(boxed.read());
  boxed.write(i);
}

console.log(oldboxed.read());
console.log(boxed.read());
vrescobar commented 5 years ago

I don't want to discourage but I put together the sources for the new 2.4.0 shen kernel[1] with the latest changes announced by Mark in head hoping that with the new release everything would be smoother and the times triplicated (in fact the chrome tab became irresponsive the first time I loaded it). Desktop is at 24 seconds and mobile at 39 seconds.

If I activate the production version at webpack, curiously enough Chrome desktop loads in 1.8 seconds, safari crashes and firefox desktop takes 33 seconds; mobile did not even load (I guess the OS killed the page).

[1] The sources are in the repo, but seems that there is some broken link: https://github.com/Shen-Language/shen-sources/issues/76

rkoeninger commented 5 years ago

2.4.0 is the version of shen-cl. Last Shen kernel release was 21.1

You're saying you built klambda for what's on shen-sources/master and then built ShenScript with that and it's slower?

Here's the diff, I don't know why it would be so much slower: https://github.com/Shen-Language/shen-sources/compare/4b17acab66e7c4d7d0d26f8b82fb11bee36a52a9...master

vrescobar commented 5 years ago

I made more or less something like that (I wanted ShenScript to pick always the latest sources)

git clone https://github.com/Shen-Language/shen-sources kernel
pushd kernel && make pure && make klambda Shen=/usr/local/bin/shen-sbcl && make release && popd
npm run lint || true
npm run test-backend
npm run render-kernel
npm run test-kernel
npm run test-frontend
vrescobar commented 5 years ago

I retried with the newest shen-sbcl 2.5.0 (just in case the generated klambda sources could make any difference) and:

-------------------------------------------
| sync  | 1 failed     | 14s, 539ms       |
| async | all passed   | 45s, 53ms        |
| both  |              | 59s, 592ms       |
-------------------------------------------

Time chrome: environment created in 1286ms. Time firefox: A page is slowing down your browser. What would you like to do? then 27465ms. Time safari: environment created in 24004ms. Time safari (iOS): (still waiting)

The problem is that running the page with a performance profiler makes it even slower and I think firefox just won't end (gave up after 9 minutes waiting for it). But even if I run it on chrome I can get some valuable information:

Screenshot 2019-09-01 at 17 09 19

Chrome performance profiler points out that most of the CPU time, about 40%, is spent in the internal javascript eval function which most probably means that the cause of the slow down must have been some of the commits from the previous days (fast on chrome, slow everywhere else).

Which brought me to retry the test with an older commit (c47328a95b8011c9bdbe1979561aa90247c1b7bd) but still the newest shen sources. That leads to apparently slower runtime test times:

--------------------------------
| sync  | 13s, 764ms           |
| async | 1m, 15s, 933ms       |
| both  | 1m, 29s, 697ms       |
--------------------------------

Using webpack in development mode:

Chrome: environment created in 1385ms. Firefox: environment created in 13095ms. Safari: environment created in 23509ms. Safari (iOS): (never shows the message)

Conclusion: Test times are not reliable measure for boot times. Somewhere between d849ea688545f4c12ce1489afbd732945efa3ca0 and c47328a95b8011c9bdbe1979561aa90247c1b7bd the performance got a serious kick and by reading the conversation and the performance profiler I guess it might be the usage of local variables but I cannot tell for sure.

Browser tests: might make sense to make slower tests with an array of selenium drivers but I never really tried in an actual project.

Edit: Fast googling leads me to some nice testing frameworks which might be perfect for measuring boot time and runtime time measurements on headless browsers.

rkoeninger commented 5 years ago

For those two commits, the only main difference is the trap functions were replaced with try/catch in iifes: https://github.com/rkoeninger/ShenScript/compare/c47328a95b8011c9bdbe1979561aa90247c1b7bd...d849ea688545f4c12ce1489afbd732945efa3ca0

rkoeninger commented 5 years ago

Hoisting idle symbol and global references helped performance a good amount:

scenario before after
sync test suite (node) 11s 7s
async test suite (node) 1m 4s 56s
sync web startup (brave) 800ms 550ms
async web startup (brave) 1200ms 950ms
sync web startup (firefox) 1800ms 1200ms
async web startup (firefox) 22s 15s

Firefox still slower than Brave (Chromium), especially in async mode, and sync is still much faster than async, but measurable benefits all around.

rkoeninger commented 5 years ago

Idle symbol hoisting still not done for symbols starting with non-alpha character, might be certain symbol(s) that gets broken by that process and needs to be excluded explicitly?