purescript / purescript-in-purescript

PureScript compiler written in PureScript (On hold / inactive)
MIT License
61 stars 7 forks source link

Performance Ideas #29

Open paf31 opened 10 years ago

paf31 commented 10 years ago

We should also try to profile the running Javascript to see where the bottleneck is.

jdegoes commented 10 years ago

I do recommend profiling before optimization. Javascript engines do remarkable transformations that make it even more difficult than usual to identify trouble spots without profiling.

Also to the extent that we can parallelize compilation, we can take advantage of web workers on the front-end (not sure if anything similar is available on the backend). With most computers having 4+ cores these days, you can get an easy 2 - 10x increase in performance just from parallelizing.

garyb commented 10 years ago

You can spawn additional processes in node and communicate with JSON over IPC, but I suspect it's a bit clunkier than web workers.

paf31 commented 10 years ago

Well psc-make could be parallelized at a fairly high level, and the module graph gives a natural scheme for data flow parallelism. Definitely worth looking into later I think.

paf31 commented 10 years ago

I'm more concerned by the several orders of magnitude difference in single threaded performance right now though :P

paf31 commented 10 years ago

Also, the Map implementation used in the typechecker has O(N) lookup due to the order of the insertions. Either we should make Map automatically-balancing (preferred), or we should try a different data structure.

garyb commented 10 years ago

I meant to say, I tried profiling psc-in-ps the other day and got a slightly strange results. As far as I can tell from the output, eqArray and bindStateT seem to be the most expensive operations with a grand total of ~5% of the total JS execution time each. It makes no sense.

This was using nprof. I couldn't get dtrace or look or to work.

paf31 commented 10 years ago

bindStateT sounds plausible, because both desugaring and type checking use StateT. We can probably use an IORef and exceptions in place of ErrorT in both cases, which should speed things up significantly.

Edit: Do you have the profiler results?

garyb commented 10 years ago

Oh sure, it makes sense for what is expensive, I just meant it's odd that almost 80% of the time is missing in the profiler result, and that the supposedly two most expensive functions are only responsible for 10% of the total time.

I'll make a gist with the results when I get a minute.

garyb commented 10 years ago

https://gist.github.com/garyb/4888f29a11c09e469a40

Different results on my home machine, the total time % value is even less for bindStateT / eqArray!

paf31 commented 10 years ago

Interesting. There are a few modules which show up quite often: Data.Map, the typechecker, Names, and PatternArrows. Maybe we can try optimizing those and see if that helps.

garyb commented 10 years ago

I'm going to see if I can get dtrace working on a SmartOS VM, might get some more sense out of that too.

garyb commented 10 years ago

I hacked some timing into the compiler, here's what I get for just running psc:

read & parse files: 10ms
  sortModules: 18ms
  desugar: 531ms
  typecheck: 6212ms
  binding groups: 96ms
  js codegen: 224ms
compile: 8519ms

So the type checker is by far the main culprit. Parser is amazingly fast considering.

garyb commented 10 years ago

I've put it in a branch c1c7f8fda4a430d003cd4409b8f1423451b8fb88

paf31 commented 10 years ago

Great to know :) I'll have a look at the typechecker in the next couple of days. I have a few ideas for how it could be improved.

garyb commented 10 years ago

New branch typecheck time is now ~3926ms down from ~5050ms. Pretty good, but considering the amount of work you did on it shame it's not a bit more!

paf31 commented 10 years ago

I have more up my sleeve :)

garyb commented 10 years ago

Good stuff :)

paf31 commented 10 years ago

I have a couple of other things I'd like to try:

but to be honest, I'm starting to lose faith that the JS version can be fast enough to be practical.

If we can't get this to within a factor of 10 of the performance of the Haskell version, I think we should seriously consider sticking with what we have, and just work on packaging it better for end users (binary distributions, apt packages, brew etc.) We should chat about it on IRC when everyone is online.

Edit: I'd also like to try replacing Data.Map with a self-balancing implementation to see if that helps.

jdegoes commented 10 years ago

Unless the optimization is guided by fine-grained profiling, it'll just be blind luck if it improves anything (I know I say this all the time but it's so true it bears repeating... all the time :) ).

If modules can be compiled independently, then user compile time will be fairly small in the common case that only a couple files have changed (obviously this would require more a intelligent build process but that's something needed anyway). The speed of incremental compilation is really all that matters, since developers spend 99% of their time compiling a project that's only changed by a few lines.

The other thing is that Purescript does not have a sophisticated optimizer. The difference between GHC with no optimizations and GHC with all optimizations is astounding. Optimizations will be driven by slow performance and should have a profound impact on not just ps-in-ps but user-land code, too. It could be a really good thing the compiler's slow because it will accelerate optimizations.

paf31 commented 10 years ago

I agree about profiling, but I have no idea how to extract any meaningful data from a running instance of psc, other than to stick a console.time on every other line.

Regarding incremental compilation, psc-make will avoid recompiling entire modules if they haven't changed, which does help a lot in the Haskell compiler, but 5s to build the Prelude is just far too slow to be practical. Incremental compilation is only going to get us so far if the single-module compile time is high. Related: finer grained incremental compilation could also be worth looking at - per-binding-group, for example. Also, we can be cleverer about which modules need to be recompiled. Right now we're recompiling too many things when a module low in the dependency DAG changes.

As for optimizations, I'm all in favour if anyone wants to investigate. Hopefully after the hangout today we'll have a whole bunch of newly initiated compiler developers :) My only stipulation is that the optimizations should keep the generated code complexity approximately the same. Aggressive inlining, for example, could be a big help to performance, but not to readability.

garyb commented 10 years ago

I had no luck with getting an OS that dtrace works on to run, but I'm trying something else now which will hopefully let us profile.

garyb commented 10 years ago

One thing that might be worth reviewing is if we have things left in the code that are being evaluated unnecessarily due to strict evaluation.

One example of this is when we use guardWith from the typechecker monad, so when doing something like this:

guardWith (strMsg $ "Expected type of kind *, was " ++ prettyPrintKind kind) $ kind == Star

we're calling strMsg and prettyPrintKind even if the error never happens, I think.

garyb commented 10 years ago

After doing that stuff with isNullaryConstructor in codegen the other day it occurred to me that I could probably hack newtypes in during codegen (applying to any data constructor that is defined like a newtype). I was hoping it would give us a nice boost, as StateT's data constructor and the like are newtypes then.

I think it knocked an entire 20-30ms off. :)

paf31 commented 10 years ago

I've changed my mind on newtypes a little. I think giving them their own keyword would be a good idea, just because it makes the code generation strategy easier to explain. data types always get a class, newtypes are just a type wrapper with no cost at runtime.

garyb commented 10 years ago

Yeah, agreed. That way, as with Haskell, you know for sure they are inlined too as using newtype can enforce the one-argument-one-constructor rule.

My above comment was more just saying that even applying it globally makes a tiny difference to performance, so it doesn't need to be that high on the list of optimizations to add. The main thing it will help with is optimizing algorithms using f-algebras a la recursion-schemes and the like, as TCO starts getting applied in places it didn't use to.

garyb commented 9 years ago

Looks like there's a way to get dtrace equivalent info out of Node on Linux now: http://www.brendangregg.com/blog/2014-09-17/node-flame-graphs-on-linux.html

Should be helpful when looking at performance again!