rkoeninger commented 5 years ago

Check List

[x] Hoist idle symbols, global function and symbol references
[x] Optimize specific commonly used functions in both sync and async form
- [x] shen.mod, integer?, @p, shen.numbyte, shen.byte->digit
- [x] symbol?, variable?, shen.pvar?
- [x] shen.fillvector
- [x] shen.dict functions
[x] Bring in the dynamic eval expansion (when kernel 22.0 is released)
[x] Find a way to cut down on the amount of async/await used in the async kernel (Firefox load time is 10x faster without async)
[ ] Used fixed-arity application where possible, avoid rest arguments in wrapper functions created by fun

Original Post

Picking up on the performance issue mentioned in #5

Observations

Startup performance in async mode is unacceptably slow. Seems to vary mainly with the browser. Brave creates the kernel in about 1.2s, but Firefox takes about 15s.

@vrescobar reported a 25s startup time in Chrome.

I think startup needs to happen in at most 500ms for ShenScript to be feasible for production use.

Startup time is normally much lower in sync mode, meaning async/await performance likely differs considerably between JavaScript engines.

Solutions

Some work has already been done to reduce the number of functions that are actually async and then don't get awaited to cut down on unnecessary asynchronicity. More opportunities for this could be done by analysis at build time or a manual survey and just listing the function names by configuration.

A more drastic approach would be to abandon async/await, even though it's convenient, I've heard it's slower than writing callback-style code. That would require a much more involved transpiler.

rkoeninger commented 5 years ago

Adding overrides has rendered another significant improvement:

scenario	before	after
sync test suite (node)	7s	4s
async test suite (node)	56s	39s
sync web startup (brave)	550ms	280ms
async web startup (brave)	950ms	400ms
sync web startup (firefox)	1200ms	500ms
async web startup (firefox)	15s	3s

The sync test suite run is on par with SBCL (~3s on the same machine)

tizoc commented 5 years ago

@rkoeninger looking great. Pre-compiling declare and other things that use eval-kl will probably improve the startup time a good amount too.

tizoc commented 5 years ago

Just checked the overrides, most of those are things that make the Shen compiler faster (dicts, variable?, etc), which kind of confirms my suspicion that at startup a lot of the time is being spent calling Shen's eval.

tizoc commented 5 years ago

The precompilation I'm talking about is https://github.com/Shen-Language/shen-sources/issues/83 btw, I haven't done anything on that front yet, but it is planned.

rkoeninger commented 5 years ago

Using otabat's improved types.kl takes that startup time even lower (little effect on test suite time as expected):

scenario	before	after
sync web startup (brave)	280ms	130ms
async web startup (brave)	400ms	60ms
sync web startup (firefox)	500ms	270ms
async web startup (firefox)	3s	170ms

Although, the fact that the async time is now shorter than the sync time is suspicious. I'm guessing there are promises that aren't being appropriately awaited, allowing the startup to return before it's actually finished, but I'm not sure where.

tizoc commented 5 years ago

Yes, the async times are suspicious, but the sync improvements confirm what I was saying.

rkoeninger commented 5 years ago

Upgrading to kernel 22.0 has yielded the performance benefits seen when testing with the hand-optimized types.kl.

I'm questioning the need to have a sync mode now that async startup performance is so good, even in Firefox.

tizoc commented 5 years ago

@rkoeninger so, are the times from your previous post correct? async is faster than sync?

rkoeninger commented 5 years ago

I'm looking at that now. Especially in Firefox, you can see that it prints "done" after about 150ms and but the activity indicator doesn't stop for another ~500ms. There's probably an await missing somewhere that I just never noticed.

rkoeninger commented 5 years ago

I think the time difference is the time it takes for the definitions in frontend to complete. They're not being awaited.

rkoeninger commented 5 years ago

Alright, with all the declarations in frontend being properly awaited, I get async startup time of 200ms (Brave) and 1500ms (Firefox).

I think the prescribed way of running Shen in the browser will be to have it run in the background in a worker, so the page won't stall for 1.5s or anything, but 1.5s is still not great.

The frontend declarations are doing a long series of declares, calling declare from JS, that could be optimized in a similar way to the expand-dynamic extension.

rkoeninger commented 5 years ago

Optimized the defineTyped function in frontend.js the same way expand-dynamic does and got back to the 60ms (Chrome), 140ms (Firefox) startup times. (Actual startup times this time, not just the time for the init to return) sync startup times are 50ms for Chrome and 90ms for Firefox so no more of that sync slower than async nonsense.

This hasn't affected overall performance much: test suite time is still 4s for sync, 38s for async so 10x slower for async.

rkoeninger commented 5 years ago

@vrescobar Can try do the iOS measurement again so we get a side-by-side comparison?

tizoc commented 5 years ago

@rkoeninger how much of the startup time is spent on shen.initialise + starting REPL related stuff? (just trying to get an idea of how much overhead is in there for programs that don't use interactive features)

rkoeninger commented 5 years ago

I don't have measurements for that, but prior experience makes me think the time is almost all in shen.initialise and since the startup time refers to web startup time, there's no repl being started. Running the defuns has always taken a trivial amount of time.

Rough measurement in ms: Hoisting symbols: 2.3849999997764826 0.14635740226114227% Setting defuns: 0.4350000526756048 0.027115315061084767% Settings overrides: 0.2950001507997513 0.016987919961329274% shen.initialise: 12.119999853894114 0.8092126696075385%

So about 81% of the time in shen.initialise and 15% hoisting symbols.

tizoc commented 5 years ago

Ok, good to know (that was expected, but just wanted to be sure).

There are parts of initialise that you can skip if you are not launching the REPL.

shen.initialise calls these four functions, and only the first two are required, the last two are for typechecking:

shen.initialise-environmentsets most globals, most of which are not needed unless you are using Shen's eval, but there are some things here that are needed for running programs, maybe I could look into splitting this one for the next release.
shen.initialise-lambda-forms associates lambda forms to function names
shen.initialise-signedfunc-lambda-forms same but for signature functions used for typechecking
shen.initialise-signedfuncs stores the signatures of functions

tizoc commented 5 years ago

I'm also thinking that both initialise-*-lambda-forms functions can probably be re-defined statically in Javascript too which may save some extra time. At the least, with the "lambdas" being hoisted so that they are not created inside that function (the function would just associate the names).

For example:

(shen.set-lambda-form-entry (cons shen.datatype-error (lambda X (shen.datatype-error X))))

becomes:

(shen.set-lambda-form-entry (cons shen.datatype-error $referenceToHoistedLambda)).

I don't know if this makes a difference for the Javascript VMs, maybe when the compiler sees that there are no external references in the lambdas it takes care of this itself and avoids the allocation, but I'm not sure.

I should probable update this in the kernel to avoid the conses, it is how it was originally, but it is not needed now when using the expand-dynamic extension. Not sure how much that will help, but it will not hurt.

rkoeninger commented 5 years ago

I experimented un-wrapping "wrapper lambdas", ran into some issue, it wasn't working, started to question how much it would help, set it aside.

(Could also provide an override for the function function that just looks up the function)

It could also introduce weird arity issues where a function is arity 2 but it's wrapped inside 2 layers of lambda, each arity 1. I guess that's what curried application is for.

As far as the note on the one commit about package size, I plan to remove the sync rendering of the kernel and sync mode, which will cut the deployable size by almost half.

tizoc commented 5 years ago

Note that what I mentioned in the previous comment is not about unwrapping anything, it is about "creating" the lambdas at the top level, and not during initialise. It may or may not make a difference to the JS engine, depending on if it is making closures or not when the functions are created inside another function.

An easy way to know is to skip the *-lambda-forms functions and see how much that speedups startup, if it makes no difference then thats not where time is being spent.

tizoc commented 5 years ago

Regarding wrapper lambdas, I tried that in Shen/Scheme (for cases where the arity = 1), turns out that it doesn't work because sometimes lambdas are created before the function is defined, so you cannot unwrap them without getting an invalid reference. It happens a lot inside the kernel because lambda forms for macros are registered before the macro function is evaluated, and other similar situations. Looking into that is something I have left for later.

rkoeninger commented 5 years ago

it doesn't work because sometimes lambdas are created before the function is defined

Right, I ran into that. And there's no point trying to get around it because if you wrap the function reference in a lookup function, then you're re-creating the same problem. Except maybe an opportunity to avoid async and the async test suite run does take 10x longer.

I don't know if hoisting lambdas would help a whole lot - it shouldn't take a JS engine any longer to create a lambda instance than any other value.

I did just push a change that flattens rendered lambdas. Very small performance improvement (might be within variation), but produces slightly more compact code and I like it.

rkoeninger commented 5 years ago

Providing overrides for put, shen.initialise_arity_table and shen.active-cons have reduced web startup time from a minimum observed of 50ms to 40ms. The shen.initialise_arity_table assumes there are no entries in the *property-vector* yet for those functions. Also, if shen.initialise_arity_table is overridden, put might not need to be overridden.

I overrode shen.active-cons, but now shen.curry-type-h is taking up time, I might do shen.curry-type as a single override. The fact that these functions are non-tail recursive means there will be a lot of trampolining interlaced with promise chaining (bit waste of time).

Another profiling observation is that a decent amount of time is spent throwing and catching exceptions and the kernel would be faster if it didn't depend on exceptions for flow control.

tizoc commented 5 years ago

Another profiling observation is that a decent amount of time is spent throwing and catching exceptions and the kernel would be faster if it didn't depend on exceptions for flow control.

See this, may help.

vrescobar commented 5 years ago

[DELETED]

Sorry I made the measurements with the HEAD of a branch in another repository!

vrescobar commented 5 years ago

New measurements with the HEAD of this repo, hash: 73d9461b626eaa8fefae844194d81718181e0a83 (actual latest master HEAD)

Desktop firefox: 166ms. Desktop chrome: 104ms. Desktop safari: 140ms.

mobile safari: 196ms.

The last weeks I've been busy with something personal and I did not follow how things developed (I think I read about an earlier expansion of klambda), and I am very impressed with the results!

The size grew to 1.2MB (184Kb when gzip), I wonder if we could go smaller than that while keeping the klambda expanded, bit now it is already usable for a realistic webpage (especially since I expect libraries to be very small and not grow much the total size).

In the other hand every single browser did not properly load the webpage on the first time I wrote the URL and had to reload it. My settings were that it should always remove the cache, so I cannot tell if it also kept an old cash before I reloaded it again; now it seems to work without any issues!

From my side this issue could be closed already now (I will test it in details when I have time again)

tizoc commented 5 years ago

Desktop firefox: 166ms. Desktop chrome: 104ms. Desktop safari: 140ms.

mobile safari: 196ms.

Looking good!

The size grew to 1.2MB (184Kb when gzip), I wonder if we could go smaller than that while keeping the klambda expanded, bit now it is already usable for a realistic webpage (especially since I expect libraries to be very small and not grow much the total size).

It may be possible but I don't think it is worth it. As I mentioned before, consider that the size you see there is for loading the whole compiler and interactive environment, and unless your program uses the REPL, eval, and the typechecker at runtime, most of that code doesn't have to be loaded (same applies for the platform's kl-> compiler itself), Shen's runtime for static programs is quite small.

rkoeninger commented 5 years ago

@vrescobar Thanks for taking the time to get those measurements. Good to see the times are decent even on mobile.

As for the bundle size, I'm looking at removing the sync version of the kernel which would cut the size in half, and make the codebase simpler. Now that async mode is so fast, sync mode is not as necessary (but it is still 10x faster when running the test suite and 2x faster on web startup which is hard to throw away).

tizoc commented 5 years ago

@rkoeninger README is missing a render-kernel step, right? doesn't work for me otherwise.

rkoeninger commented 5 years ago

@tizoc The "Building and Testing" section tells you to just read the travis build script, which runs the render command.

I'll elaborate on that part.

tizoc commented 5 years ago

Btw, after a few loads so that things get cached, the environment load can go as low as 71ms on chrome on my computer (2013 macbook pro, not particularly modern or fast) and 190ms on firefox (but most of the time ~250-300ms).

rkoeninger commented 5 years ago

It works in Safari, right?

~300ms is the slowest I've seen in Firefox on my 2011 thinkpad.

rkoeninger commented 5 years ago

I think Safari is the only browser that has implemented tail calls. Would be interesting to try ShenScript in Safari with trampolines removed and see if it works. And what the performance is.

tizoc commented 5 years ago

~185ms on safari. Works, ~but for some reason setting console.log as the resolve function of the promise doesn't work, I have to use alert to see the result~ (it was me being dumb, was on the wrong console tab)

rkoeninger commented 5 years ago

Cool! So it works and loads in a decent time in Chrome, Firefox and Safari. And by "Chrome" I mean Chromium-based, it works in Brave and probably works in Opera and Vivaldi and a lot of others.

I know it doesn't work in Edge as Edge does not support the object spread syntax e.g. { ...obj1, ...obj2 }. Maybe I can get webpack or babel to factor those out.

rkoeninger commented 5 years ago

Added babel plugin to webpack config to replace use of spread operator with Edge-compatible code. Now works in Edge as well. Load time ~250ms (compare 50ms in Chrome, 100ms in Firefox).

Speed isn't great but it works.

rkoeninger commented 5 years ago

Added functionality to the compiler to detect when a defun refers only to primitive (or overridden with strictly sync version) functions or itself and then render that function in sync mode. At least a 20% speedup.

At this point I deemed the async mode fast enough and dumped the pre-rendered sync kernel. Cut the web bundle and npm packages' size by about 45%.

Version 0.9.0 is the first to be deployed to npm.

Time for this thread to close. Thanks, guys!

rkoeninger / ShenScript

Improving Startup and Overall Performance #7

Check List

Original Post