Closed rkoeninger closed 5 years ago
Adding overrides has rendered another significant improvement:
scenario | before | after |
---|---|---|
sync test suite (node) | 7s | 4s |
async test suite (node) | 56s | 39s |
sync web startup (brave) | 550ms | 280ms |
async web startup (brave) | 950ms | 400ms |
sync web startup (firefox) | 1200ms | 500ms |
async web startup (firefox) | 15s | 3s |
The sync test suite run is on par with SBCL (~3s on the same machine)
@rkoeninger looking great. Pre-compiling declare
and other things that use eval-kl
will probably improve the startup time a good amount too.
Just checked the overrides, most of those are things that make the Shen compiler faster (dicts, variable?
, etc), which kind of confirms my suspicion that at startup a lot of the time is being spent calling Shen's eval
.
The precompilation I'm talking about is https://github.com/Shen-Language/shen-sources/issues/83 btw, I haven't done anything on that front yet, but it is planned.
Using otabat's improved types.kl takes that startup time even lower (little effect on test suite time as expected):
scenario | before | after |
---|---|---|
sync web startup (brave) | 280ms | 130ms |
async web startup (brave) | 400ms | 60ms |
sync web startup (firefox) | 500ms | 270ms |
async web startup (firefox) | 3s | 170ms |
Although, the fact that the async time is now shorter than the sync time is suspicious. I'm guessing there are promises that aren't being appropriately awaited, allowing the startup to return before it's actually finished, but I'm not sure where.
Yes, the async times are suspicious, but the sync improvements confirm what I was saying.
Upgrading to kernel 22.0 has yielded the performance benefits seen when testing with the hand-optimized types.kl.
I'm questioning the need to have a sync mode now that async startup performance is so good, even in Firefox.
@rkoeninger so, are the times from your previous post correct? async is faster than sync?
I'm looking at that now. Especially in Firefox, you can see that it prints "done" after about 150ms and but the activity indicator doesn't stop for another ~500ms. There's probably an await
missing somewhere that I just never noticed.
I think the time difference is the time it takes for the definitions in frontend
to complete. They're not being awaited.
Alright, with all the declarations in frontend
being properly awaited, I get async startup time of 200ms (Brave) and 1500ms (Firefox).
I think the prescribed way of running Shen in the browser will be to have it run in the background in a worker, so the page won't stall for 1.5s or anything, but 1.5s is still not great.
The frontend
declarations are doing a long series of declare
s, calling declare
from JS, that could be optimized in a similar way to the expand-dynamic
extension.
Optimized the defineTyped
function in frontend.js
the same way expand-dynamic
does and got back to the 60ms (Chrome), 140ms (Firefox) startup times. (Actual startup times this time, not just the time for the init to return) sync startup times are 50ms for Chrome and 90ms for Firefox so no more of that sync slower than async nonsense.
This hasn't affected overall performance much: test suite time is still 4s for sync, 38s for async so 10x slower for async.
@vrescobar Can try do the iOS measurement again so we get a side-by-side comparison?
@rkoeninger how much of the startup time is spent on shen.initialise
+ starting REPL related stuff? (just trying to get an idea of how much overhead is in there for programs that don't use interactive features)
I don't have measurements for that, but prior experience makes me think the time is almost all in shen.initialise
and since the startup time refers to web startup time, there's no repl being started. Running the defun
s has always taken a trivial amount of time.
Rough measurement in ms: Hoisting symbols: 2.3849999997764826 0.14635740226114227% Setting defuns: 0.4350000526756048 0.027115315061084767% Settings overrides: 0.2950001507997513 0.016987919961329274% shen.initialise: 12.119999853894114 0.8092126696075385%
So about 81% of the time in shen.initialise and 15% hoisting symbols.
Ok, good to know (that was expected, but just wanted to be sure).
There are parts of initialise that you can skip if you are not launching the REPL.
shen.initialise
calls these four functions, and only the first two are required, the last two are for typechecking:
shen.initialise-environment
sets most globals, most of which are not needed unless you are using Shen's eval
, but there are some things here that are needed for running programs, maybe I could look into splitting this one for the next release.shen.initialise-lambda-forms
associates lambda forms to function namesshen.initialise-signedfunc-lambda-forms
same but for signature functions used for typecheckingshen.initialise-signedfuncs
stores the signatures of functionsI'm also thinking that both initialise-*-lambda-forms
functions can probably be re-defined statically in Javascript too which may save some extra time. At the least, with the "lambdas" being hoisted so that they are not created inside that function (the function would just associate the names).
For example:
(shen.set-lambda-form-entry (cons shen.datatype-error (lambda X (shen.datatype-error X))))
becomes:
(shen.set-lambda-form-entry (cons shen.datatype-error $referenceToHoistedLambda))
.
I don't know if this makes a difference for the Javascript VMs, maybe when the compiler sees that there are no external references in the lambdas it takes care of this itself and avoids the allocation, but I'm not sure.
I should probable update this in the kernel to avoid the cons
es, it is how it was originally, but it is not needed now when using the expand-dynamic
extension. Not sure how much that will help, but it will not hurt.
I experimented un-wrapping "wrapper lambdas", ran into some issue, it wasn't working, started to question how much it would help, set it aside.
(Could also provide an override for the function
function that just looks up the function)
It could also introduce weird arity issues where a function is arity 2 but it's wrapped inside 2 layers of lambda, each arity 1. I guess that's what curried application is for.
As far as the note on the one commit about package size, I plan to remove the sync rendering of the kernel and sync mode, which will cut the deployable size by almost half.
Note that what I mentioned in the previous comment is not about unwrapping anything, it is about "creating" the lambdas at the top level, and not during initialise. It may or may not make a difference to the JS engine, depending on if it is making closures or not when the functions are created inside another function.
An easy way to know is to skip the *-lambda-forms
functions and see how much that speedups startup, if it makes no difference then thats not where time is being spent.
Regarding wrapper lambdas, I tried that in Shen/Scheme (for cases where the arity = 1), turns out that it doesn't work because sometimes lambdas are created before the function is defined, so you cannot unwrap them without getting an invalid reference. It happens a lot inside the kernel because lambda forms for macros are registered before the macro function is evaluated, and other similar situations. Looking into that is something I have left for later.
it doesn't work because sometimes lambdas are created before the function is defined
Right, I ran into that. And there's no point trying to get around it because if you wrap the function reference in a lookup function, then you're re-creating the same problem. Except maybe an opportunity to avoid async and the async test suite run does take 10x longer.
I don't know if hoisting lambdas would help a whole lot - it shouldn't take a JS engine any longer to create a lambda instance than any other value.
I did just push a change that flattens rendered lambdas. Very small performance improvement (might be within variation), but produces slightly more compact code and I like it.
Providing overrides for put
, shen.initialise_arity_table
and shen.active-cons
have reduced web startup time from a minimum observed of 50ms to 40ms. The shen.initialise_arity_table
assumes there are no entries in the *property-vector*
yet for those functions. Also, if shen.initialise_arity_table
is overridden, put
might not need to be overridden.
I overrode shen.active-cons
, but now shen.curry-type-h
is taking up time, I might do shen.curry-type
as a single override. The fact that these functions are non-tail recursive means there will be a lot of trampolining interlaced with promise chaining (bit waste of time).
Another profiling observation is that a decent amount of time is spent throwing and catching exceptions and the kernel would be faster if it didn't depend on exceptions for flow control.
Another profiling observation is that a decent amount of time is spent throwing and catching exceptions and the kernel would be faster if it didn't depend on exceptions for flow control.
See this, may help.
[DELETED]
Sorry I made the measurements with the HEAD of a branch in another repository!
New measurements with the HEAD of this repo, hash: 73d9461b626eaa8fefae844194d81718181e0a83 (actual latest master HEAD)
Desktop firefox: 166ms. Desktop chrome: 104ms. Desktop safari: 140ms.
mobile safari: 196ms.
The last weeks I've been busy with something personal and I did not follow how things developed (I think I read about an earlier expansion of klambda), and I am very impressed with the results!
The size grew to 1.2MB (184Kb when gzip), I wonder if we could go smaller than that while keeping the klambda expanded, bit now it is already usable for a realistic webpage (especially since I expect libraries to be very small and not grow much the total size).
In the other hand every single browser did not properly load the webpage on the first time I wrote the URL and had to reload it. My settings were that it should always remove the cache, so I cannot tell if it also kept an old cash before I reloaded it again; now it seems to work without any issues!
From my side this issue could be closed already now (I will test it in details when I have time again)
Desktop firefox: 166ms. Desktop chrome: 104ms. Desktop safari: 140ms.
mobile safari: 196ms.
Looking good!
The size grew to 1.2MB (184Kb when gzip), I wonder if we could go smaller than that while keeping the klambda expanded, bit now it is already usable for a realistic webpage (especially since I expect libraries to be very small and not grow much the total size).
It may be possible but I don't think it is worth it. As I mentioned before, consider that the size you see there is for loading the whole compiler and interactive environment, and unless your program uses the REPL, eval, and the typechecker at runtime, most of that code doesn't have to be loaded (same applies for the platform's kl->
@vrescobar Thanks for taking the time to get those measurements. Good to see the times are decent even on mobile.
As for the bundle size, I'm looking at removing the sync version of the kernel which would cut the size in half, and make the codebase simpler. Now that async mode is so fast, sync mode is not as necessary (but it is still 10x faster when running the test suite and 2x faster on web startup which is hard to throw away).
@rkoeninger README is missing a render-kernel
step, right? doesn't work for me otherwise.
@tizoc The "Building and Testing" section tells you to just read the travis build script, which runs the render command.
I'll elaborate on that part.
Btw, after a few loads so that things get cached, the environment load can go as low as 71ms on chrome on my computer (2013 macbook pro, not particularly modern or fast) and 190ms on firefox (but most of the time ~250-300ms).
It works in Safari, right?
~300ms is the slowest I've seen in Firefox on my 2011 thinkpad.
I think Safari is the only browser that has implemented tail calls. Would be interesting to try ShenScript in Safari with trampolines removed and see if it works. And what the performance is.
~185ms on safari. Works, ~but for some reason setting console.log as the resolve function of the promise doesn't work, I have to use alert
to see the result~ (it was me being dumb, was on the wrong console tab)
Cool! So it works and loads in a decent time in Chrome, Firefox and Safari. And by "Chrome" I mean Chromium-based, it works in Brave and probably works in Opera and Vivaldi and a lot of others.
I know it doesn't work in Edge as Edge does not support the object spread syntax e.g. { ...obj1, ...obj2 }
. Maybe I can get webpack or babel to factor those out.
Added babel plugin to webpack config to replace use of spread operator with Edge-compatible code. Now works in Edge as well. Load time ~250ms (compare 50ms in Chrome, 100ms in Firefox).
Speed isn't great but it works.
Added functionality to the compiler to detect when a defun refers only to primitive (or overridden with strictly sync version) functions or itself and then render that function in sync mode. At least a 20% speedup.
At this point I deemed the async mode fast enough and dumped the pre-rendered sync kernel. Cut the web bundle and npm packages' size by about 45%.
Version 0.9.0 is the first to be deployed to npm.
Time for this thread to close. Thanks, guys!
Check List
shen.mod
,integer?
,@p
,shen.numbyte
,shen.byte->digit
symbol?
,variable?
,shen.pvar?
shen.fillvector
shen.dict
functionsasync/await
used in the async kernel (Firefox load time is 10x faster without async)fun
Original Post
Picking up on the performance issue mentioned in #5
Observations
Startup performance in async mode is unacceptably slow. Seems to vary mainly with the browser. Brave creates the kernel in about 1.2s, but Firefox takes about 15s.
@vrescobar reported a 25s startup time in Chrome.
I think startup needs to happen in at most 500ms for ShenScript to be feasible for production use.
Startup time is normally much lower in sync mode, meaning async/await performance likely differs considerably between JavaScript engines.
Solutions
Some work has already been done to reduce the number of functions that are actually async and then don't get awaited to cut down on unnecessary asynchronicity. More opportunities for this could be done by analysis at build time or a manual survey and just listing the function names by configuration.
A more drastic approach would be to abandon async/await, even though it's convenient, I've heard it's slower than writing callback-style code. That would require a much more involved transpiler.